



LA'PAiW 1 XATAsW^S|r^v 


SYSTEM 

a<w 


a. no 2-2 / ^Jo ' / 

Ac No. II77C Date of release of loan 

This book should be returned on or before the date last stamped below. 
An overdue charge of 10 nP. will be Charged for each day the book is 
kept overtime. 





AH IHTRODUCTIOH TO 

PROBABILITY THEORY 

AHD ITS 

APPLICATIONS 



WILEY PUBLICATIONS 
IN STATISTICS 

Walter A. Shewhart, Editor 

Mathematical Statistics 

WALD—Statistical Decision Functions. 
FISHER—Contributions to Mathematical 
Statistics. 

FELLER—An Introduction to Probabil¬ 
ity Theory and Its Applications, Vol¬ 
ume One. 

v HOEL—Introduction to Mathematical 

Statistics. 

WALD—Sequential Analysis. 

Applied Statistics 

HALD—Statistics (iw press). 

TIPPETT—Technological Applications of 
Statistics. 

DEMING—Some Theory of Sampling. 
COCHRAN and COX—Experimental 
Designs. 

DODGE and ROMIG—Sampling 
Inspection Tables. 

RICE—Control Charts. 

Related Books of Interest to Statisticians 
HAUSER and LEONARD—Government 
Statistics for Business Use. 



AH INTRODUCTION. TO 

PROBABILITY THEORY 

AND ITS 

APPLICATIONS 


BY 

WILLIAM FELLER 

Eugene Higgins Professor of Mathematics 
Princeton University 


VOLUME ONE 


New York 

JOHN WILEY & SONS, INC. 
CHAPMAN & HALL, LTD. 
LONDON 



Copyright, 1950 

BY 

William Feller 


All Rights Reserved 

This book or any part thereof must not 
be reproduced in any form without 
the written permission of the publisher. 


Reproduction in whole or in part permitted 
for any purpose of the United States Government. 


Copyright, Canada, 1950, International Copyright, 1950 
William Feller, Proprietor 


A ll Foreign Rights Reserved 
Reproduction in whole or in part forbidden. 

SECOND PRINTING, MAY, 1951 


PRINTED IN THE UNITED STATES OF AMERICA 



To 

O. E. NEUGEBAUER 




PREFACE 


It was the author's original intention to write a book on analytical 
methods in probability theory in which the latter was to be treated as 
a topic in pure mathematics. Such a treatment would have been more 
uniform and hence more satisfactory from an aesthetic point of view; 
it would also have been more appealing to pure mathematicians. How¬ 
ever, the generous support by the Office of Naval Research of work in 
probability theory at Cornell University led the author to a more 
ambitious and less thankful undertaking of satisfying heterogeneous 
needs. 

It is the purpose of this book to treat probability theory as a self- 
contained mathematical subject rigorously, avoiding non-mathematical 
concepts. At the same time, the book tries to describe the empirical 
background and to develop a feeling for the great variety of practical 
applications. This purpose is served by many special problems, nu¬ 
merical estimates, and examples which interrupt the main flow of the 
text. They are clearly set apart in print and are treated in a more 
picturesque language and with less formality. A number of special 
topics have been included in order to exhibit the power of general 
methods and to increase the usefulness of the book to specialists in 
various fields. To facilitate reading, detours from the main path are 
indicated by stars. The knowledge of starred sections is not assumed 
in the remainder. 

A serious attempt has been made to unify methods. The specialist 
will find many simplifications of existing proofs and also new results. 
In particular, the theory of recurrent events has been developed for the 
purpose of this book. It leads to a new treatment of Markov chains 
which permits simplification even in the finite case. 

The examples are accompanied by about 340 problems mostly with 
complete solutions. Some of them are simple exercises, but most of 
them serve as additional illustrative material to the text or contain 
various complements. One purpose of the examples and problems is 
to develop the reader's intuition and art of probabilistic formulation. 
Several previously treated examples show that apparently difficult 
problems may become almost trite once they are formulated in a 
natural way and put into the proper context. 

vii 



viii 


PREFACE 


There is a tendency in teaching to reduce probability problems to 
pure analysis as soon as possible and to forget the specific character¬ 
istics of probability theory itself. Such treatments are based on a 
poorly defined notion of random variables usually introduced at the 
outset. This book goes to the other extreme and dwells on the notion 
of sample space, without which random variables remain an artifice. 

In order to present the true background unhampered by measura¬ 
bility questions and other purely analytic difficulties this volume is 
restricted to discrete sample spaces. This restriction is severe, but 
should be welcome to non-mathematical users. It permits the inclusion 
of special topics which are not easily accessible in the literature. At 
the same time, this arrangement makes it possible to begin in an ele¬ 
mentary way and yet to include a fairly exhaustive treatment of such 
advanced topics as random walks and Markov chains. The general 
theory of random variables and their distributions, limit theorems, 
diffusion theory, etc., is deferred to a succeeding volume. 

This book would not have been written without the support of the 
Office of Naval Research. One consequence of this support was a 
fairly regular personal contact with J. L. Doob, whose constant criti¬ 
cism and encouragement were invaluable. To him go my foremost 
thanks. The next thanks for help are due to John Riordan, who 
followed the manuscript through two versions. Numerous corrections 
and improvements were suggested by my wife who read both the manu¬ 
script and proof. 

The author is also indebted to K. L. Chung, M. Donsker, and 
S. Goldberg, who read the manuscript and corrected various mistakes; 
the solutions to the majority of the problems were prepared by S. 
Goldberg. Finally, thanks are due to Kathryn Hollenbach for patient 
and expert typing help; to E. Elyash, W. Hoffman, and J. R. Kinney 
for help in proofreading. 

William Feller 

Cornell University 
January 1950 



CONTENTS 


^£&TRODUCTION: THE NATURE OF PROBABILITY THEORY . 1 

1. The Background. 1 

2. Procedure. 3 

3. “Statistical 7 7 Probability. 4 

4. Historical Note. 6 

CHAPTER 1 THE SAMPLE SPACE. 8 

1. The Empirical Background. 8 

2. Illustrative Examples .. 10 

3. The Sample Space. Events . 12 

4. Relations among Events. 13 

5. Discrete Sample Spaces . . . . 16 

6. Probabilities in Discrete Sample Spaces. 17 

7. Problems for Solution. 21 

CHAPTER 2 ELEMENTS OF COMBINATORIAL ANALYSIS. STIR¬ 
LING'S FORMULA. 23 

1. Preliminaries. 23 

2. Samples. 24 

3. Examples . 26 

4. Partitions. 30 

5. The Hypergeometric Distribution. 33 

6. Binomial Coefficients. 40 

7. Stirling’s Formula. 41 

8. Problems for Solution: Combinatorial. 44 

9. Problems for Solution: Binomial Coefficients and Stirling’s Formula 47 

CHAPTER 3 THE SIMPLEST OCCUPANCY AND ORDERING 

PROBLEMS. 51 

1. Combinatorial Lemmas. 51 

2. Bose-Einstein and Fermi-Dirac Statistics. 53 

3. The Classical Occupancy Problem. 54 

4. Runs. 56 

5. Problems for Solution. 58 

CHAPTEk 4 COMBINATION OF EVENTS. 60 

1. Union of Events. 50 

2. Examples ..52 

3. The Realization of m among N Events.* 64 

4. Application to Matching and Guessing. 66 

5. Application to the Classical Occupancy Problem . 69 

6. Miscellany. 74 

7. Problems for Solution. 75 


IX 






































X 


CONTENTS 


CHAPTER 5 CONDITIONAL PROBABILITY. STATISTICAL IN¬ 
DEPENDENCE . 78 

1. Conditional Probability. 78 

2. Compound Experiments. 81 

3. Statistical Independence. 85 

4. Repeated Trials. 88 

4a. A Guide to Abstract Language. 91 

5. Applications to Genetics. 92 

6. Sex-linked Characters. 96 

7. Selection. 99 

8. Problems for Solution.100 

CHAPTER 6 THE BINOMIAL AND THE POISSON DISTRIBU¬ 
TIONS .104 

1. Bernoulli Trials.104 

2. The Binomial Distribution.105 

3. The Central Term.109 

4. The Poisson Approximation .110 

5. The Poisson Distribution . ..115 

6. Observations Fitting the Poisson Distribution.119 

7. The Multinomial Distribution .124 

8. Problems for Solution.125 

CHAPTER 7 THE NORMAL APPROXIMATION TO THE BINO¬ 
MIAL DISTRIBUTION.129 

1. The Normal Distribution.129 

2. The DeMoivre-Laplace Limit Theorem .133 

3. The Law of Large Numbers.141 

4. Relation to the Poisson Approximation .143 

5. Large Deviations.144 

6. Problems for Solution.145 

CHAPTER 8 UNLIMITED SEQUENCES OF BERNOULLI TRIALS 149 

1. Infinite Sequences of Trials.149 

2. Systems of Gambling.151 

3. The Borcl-Cantelli Lemmas .154 

4. The Strong Law of Large Numbers.155 

5. The Law of the Iterated Logarithm.157 

6. Interpretation in Number Theory Language.161 

7. Problems for Solution.163 

CHAPTER 9 RANDOM VARIABLES; EXPECTATION.164 

• 1. Random Variables.164 

2. Expectations ..171 

3. Examples and Applications.173 

4. The Variance.177 

5. Covariance; Variance of a Sum.179 

6. Chebyshev’s Inequality.183 

7. Kolmogorov’s Inequality.184 

8. The Correlation Coefficient.186 

9. Problems for Solution. 187 














































CONTENTS 


xi 


CHAPTER 10 LAWS OF LARGE NUMBERS ..191 

1. Identically Distributed Variables. 191 

2. Proof of the Law of Large Numbers.195 

3. The Theory of “Fair” Games. 195 

4. The Petersburg Game. 199 

5. Variable Distributions.201 

6. Applications to Combinatorial Analysis .204 

7. The Strong Law of Large Numbers.207 

8. Problems for Solution.209 

CHAPTER 11 INTEGRAL VALUED VARIABLES. GENERATING 

FUNCTIONS.212 

1. Generalities .212 

2. Convolutions.214 

3. The Geometric and the Pascal Distributions.217 

4. Relation to Holding or Waiting Times.218 

5. Compound Distributions.221 

6. Chain Reactions .223 

7. Partial Fraction Expansions .227 

8. The Continuity Theorem.232 

9. Problems for Solution.235 

CHAPTER 12 RECURRENT EVENTS: THEORY.238 

1. Definition.238 

2. Recurrence Times.241 

3. Fundamental Theorems.243 

4. Application of the Central Limit Theorem.248 

5. Fluctuations in the Coin-tossing Game; the Arc Sine Law.249 

6. Proof of the Theorems of Section 5.255 

7. Proof of Theorem 3 of Section 3.259 

8. Problems for Solution.261 

CHAPTER 13 RECURRENT EVENTS: APPLICATIONS TO RUNS 

AND RENEWAL THEORY.264 

1. Success Runs.264 

2. More General Patterns.268 

3. Numerical Estimates .269 

4. The Renewal Equation .272 

5. Examples . 275 

6. Problems for Solution.277 

CHAPTER 14 RANDOM WALK AND RUIN PROBLEMS .... 279 

1. General Orientation.279 

2. The Gambler’s Ruin.282 

3. Expected Duration of the Game .286 

4. Generating Functions for the Duration of the Game and First Passage 

Times.288 

5. Explicit Expressions.290 

6. Passage to the Limit; Diffusion Processes .293 

7. Random Walks in the Plane and Space.297 













































xii CONTENTS 


8. The Generalized One-dimensional Random Walk (Sequential Sam¬ 

pling) .300 

9. Problems for Solution.304 

CHAPTER 15 MARKOV CHAINS.307 

1. Definition.307 

2. Illustrative Examples.310 

3. Higher Transition Probabilities.317 

4. Irreducible Chains.318 

5. Classification of States.320 

6. Ergodic Properties of Aperiodic Chains; Stationary Distributions . . 324 

7. Periodic Chains.329 

8. Transient States; Absorption Probabilities.332 

9. Application to Card Shuffling.335 

10. The General Markov Process.337 

11. Miscellany.341 

12. Problems for Solution.344 

CHAPTER 16 ALGEBRAIC TREATMENT OF FINITE MARKOV 

CHAINS.347 

1. General Theory.347 

2. Examples .351 

3. Random Walk with Reflecting Barriers .355 

4. Transient States; Absorption Probabilities.358 

5. Application to Recurrence Times.362 

CHAPTER 17 THE SIMPLEST TIME-DEPENDENT STOCHASTIC 

PROCESSES.363 

1. General Orientation.363 

2. The Poisson Process.364 

3. The Pure Birth Process.367 

4. Divergent Birth Processes.369 

5. The Birth and Death Process.371 

6. Exponential Holding Times.375 

7. Waiting Line and Servicing Problems.377 

8. The Backward (Retrospective) Equations.384 

9. Generalization; The Kolmogorov Equations . 386 

10. Degenerate Processes. 391 

11. Problems for Solution.394 

ANSWERS TO PROBLEMS.397 


INDEX 


411 





































INTRODUCTION 


THE NATURE OF PROBABILITY THEORY 
1. The Background 

Probability is a mathematical discipline with aims akin to those 
of, for example, geometry or analytical mechanics. In each discipline, 
we must be careful to distinguish three aspects of the theory: (a) the 
formal logical content, ( b ) the intuitive background, (c) applications. 

(a) Formal Logical Contents. A salient feature of mathematics is 
that it is concerned solely with relations among undefined things. This 
point is well illustrated by the game of chess. It is impossible to 
“define” chess otherwise than by stating a set of rules. The conven¬ 
tional shape of the pieces may be described to some extent, but it will 
not always be obvious to the inexperienced player which piece is in¬ 
tended for “king.” The important thing is to know how the pieces 
move and act, meaning a set of rules. As a matter of fact, chess can be 
played without pieces and without chessboard. It is played in writing, 
with the sixty-four fields represented by as many symbols (A, 1), • • •, 
(H, 8) in much the same way as analytical geometry describes geo¬ 
metrical points by their coordinates. 

As it does not make sense in chess to ask what the “true nature” of 
a pawn or king is, so geometry does not care what a point and a straight 
line “really are.” They remain undefined notions, and the axioms of 
geometry specify the"relations among them: two points determine a 
line^etc. These are the rules of the game. The mathematician studies 
several non-Euclidean geometries in the same way as a chess player 
may play different variants of the chess game. The various geometries 
can be studied independently of their relations to reality, and, similarly, 
it is possible in mechanics to study how bodies would move if Newton’s 
law of attraction were replaced by another one* 

( b ) The Intuitive Background. An essential difference between chess 
and geometry is that the rules of chess are arbitrary, whereas the 
axioms of geometry refer in an obvious manner to an intuitive back¬ 
ground. In fact, geometrical intuition is so strong that it is likely to 
run ahead of logical reasoning. The extent to which logic, intuition, 
and physical experience are interdependent is a difficult problem of 

1 




2 


INTRODUCTION 


philosophy into which we need not enter. It is certain that intuition 
can be trained and developed. In chess, the bewildered beginner 
moves cautiously, recalling the individual rules, whereas the expe¬ 
rienced player absorbs a complicated situation at a glance and is often 
unable to account rationally for his intuition. Similarly, it is possible 
to develop an intuitive feeling for relations, say, in a four-dimensional 
space. Further, the collective intuition of mankind appears to make 
rapid progress. Newton’s notion of a field of force and of action at a 
distance, and Maxwell’s concept of electromagnetic waves travelling 
through space, were at first described as “unthinkable” and contrary 
to intuition. With the popularization of mechanics and radio an educa¬ 
tion in the history of ideas is now required to understand why those 
theories originally seemed strange and unacceptable. When the theory 
of probability was new, it had to struggle against prejudiced intuitions 
and types of reasoning which are no longer cultivated, so that it is 
hard for us to understand the initial difficulties. Nowadays small 
boys are betting and shooting dice, newspapers report on samples of 
public opinion, and the magic of statistics embraces all phases of life 
to the extent that young girls anxiously watch the statistics of their 
chances to get married. Thus everyone has acquired an intuitive 
feeling for the meaning of statements like “the chances for this event 
are three in five.” This intuition suffices as a background for the 
first few formal rules of probability. It will be trained and developed 
as the theory progresses and acquaintance is made with a variety of 
more sophisticated applications. 

(c) Applications. In applications of geometry and mechanics theo¬ 
retical concepts are identified with certain physical objects, but the 
method is flexible and varies from occasion to occasion so that no 
general rules can be given. The concept of a rigid body is useful and 
essential to mechanics, and yet no physical objects meet the specifica¬ 
tions. Only experience teaches us which bodies can, with a satisfactory 
approximation, be treated as rigid. Rubber is usually given as a 
typical example of a non-rigid body, but in discussing the motion of 
automobiles most textbooks treat the wheels, including rubber tires, 
as rigid bodies. T his is an example of how theoretical m odels ^are 
chosen and varied accor ding to convenien o^mf needs. Depending on 
our purpo^es' weTeel free to disregard atomic theories and treat the sun 
as a tremendous ball of continuous matter or, on another occasion, as a 
single mass point. We must always remember that mathematics deals ^ 
with abstract models and that different models can describe the same j 
empirical situation with various degrees of approximation and sim-, 
plicity. The manner in which mathematical theories are applied does 





INTRODUCTION 


3 


not depend on preconceived ideas and is not a matter of logic: it is a 
purposeful technique which depends on, and changes with, experience. 
Of course, every phase of human activities is of interest to the philos¬ 
opher, and a philosophical analysis of applications of mathematics is a 
legitimate study. However, such an analysis is not within the realm 
of mathematics, physics, or statistics. A philosophical discussion of 
the foundations of probability must be divorced from the mathematical 
theory and its applications to the same extent as a discussion of our 
intuitive space concept is now divorced from geometry (though this 
has not always been so). 

2. Procedure 

The history of probability (and of mathematics in general) shows a 
stimulating interplay of theory and applications: progress in theory 
opens new fields of applications, and each new application creates new 
theoretical problems and influences the direction of research. Today 
applications of the theory of probability extend over many fields of 
different natures, and the number of applications is steadily increasing. 
Only a general mathematical theory is flexible enough to provide proper 
tools for such a variety of problems, and we must withstand the tempta¬ 
tion to keep our notions, pictures, and terms too close to one particular 
field of experience. We require a rigorous mathematical theory pro¬ 
ceeding along the lines which are generally accepted in geometry and 
mechanics. 

We shall start from the simplest experiences such as tossing a coin 
or throwing dice, where all statements have an obvious intuitive mean¬ 
ing. This intuition will be translated into an abstract model which 
will then be generalized to take care of more complicated situations. 
Illustrative examples should explain the empirical background of the 
theories and develop the reader’s intuition, but the theory itself will 
have a mathematical character. We shall no more attempt to explain 
the “true meaning” of probability than the modern physicist dwells 
on the real meaning of mass and energy or the geometer explains the 
nature of a point. Instead, we shall prove exact theorems and show 
how they are applied in actual practice. 

4^ Originally, the theory of probability was developed to describe a/ 
very limited seTof'e^Perienccs eoiiiH;cl«J with games of chance . The 
main objective of the theory i n this connection wa s tne calcul ation of 
c ertjS probfl.h}lifiesf We shall follow the historica l path and star t 
fr om games of chan ce. We ~3o~so not for their importance or genera l 
inte rest, but because ga mes oTchan ce provide the most intuiti ve back- 1 
ground and because, for the^en^vroflion-mathematical readers, we^ 



4 


INTRODUCTION 


wish to postpone the use of special analytical tools as long as possible. 
Accordingly, in the first few chapters we shall calculate a few typical 
probabilities, but the general theory is independent of particular numer¬ 
ical values. Its object is to discover general laws and relations and to 
construct abstract models which can to some extent describe physical 
facts. Probabilities play for us the same role as masses in mechanics: 
it is possible to discuss the motion of the planetary system without 
knowing the individual masses and without contemplating methods for 
their actual measurement; it is also possible to study the hypothetical 
motion of a non-existent planetary system. We are usually interested 
in genamldaws and rarel y in spe cific numerical computations. Useful 
probability models often refer to non-existing worlds. Thus the 
systems used in automatic telephone exchanges ar e based on simple 
probability considerations, and billions of dollarsTtavG been invested 
on this basis. It is clear that the underlying theory must compare 
various potential systems of exchanges, the majority of which will 
never exist, since the theory proves them to be inferior. Similarly, in 
insurance, probability theory is used to avoid certain undesirable situa¬ 
tions which, as a consequence, can never be observed. When actual 
observations and numerical estimates are desired, it is usually necessary 
to use refined methods which form a chapter of mathematical statistics 
and lead beyond probability theory in the proper sense of the word. 

3. “Statistical” Probability 

The success of the modern mathematical theory of probability is 
bought at a price: the theory is limited to one particular aspect of the 
matter. The intuitive notion of probability is connected with the 
general inductive reasoning and with judgments such as “Paul is 
probably a happy man/’ “probably this book will be a failure,” 
“Fermat’s conjecture is probably false.” Probability judgments of this 
sort are of interest to the philosopher and the logician,*and they are 
also a legitimate object of a mathematical theory. 1 It must he under- H 
s tood, however, that we are concerned not wit h modes nf inductive i 
r easoning but with something that might be called physical or statistical ! 
; probabilit y . In a rough way we may characterize this concept T>yY 
saying that our probabilities do not refer to judgments but to possible 1 
outcomes of a conceptual experiment. Before we speak of probabilities, J 
we must agree on an idealized model of a particulai^fconceptual experi-^ 

l A modem axiomatic system was given by B. O. Koopman, The Axioms and 
Algebra of Intuitive Probability, Annals of Mathematics (2), vol. 41 (1940), pp. 
269-292, and The Bases of Probability, Bulletin of the American Mathematical 
Society , vol. 46 (1940), pp. 763-774. 





INTRODUCTION 


5 

Sment such as tossing a coin, sampling kangaroos on the moon, observing I 
a particle under diffusion,Counting the number of telephone calls.* 
t he outset wejnust agree on the possib le outcomes of this e xperi m ent, I 
(o ur sample syace) and our probabilities will be assocTatied with these 
a nd not hing else. This is in analogy with the pro^ureln mechanics 
wherefictitious models involving two, three, or seventeen mass points 
are introduced, and these points are devoid of individual properties. 
Similarly, if we agree to analyze the coin-tossing game, then we speak 
of sequences like “head, head, tail, head, tail, ..and nothing else. 
There is no place in our system for speculations concerning the prob¬ 
ability that the sun will rise tomorrow. Before speaking of such a 
probability we would have to agree on an (idealized) model of an experi¬ 
ment, and this would presumably run along the lines “out of infinitely 
many worlds one is selected at random ....” Little imagination is 
required to construct such a model, but it appears both uninteresting 
and meaningless. 

The astronomer speaks of measuring the temperature at the center 
of the sun or of travel to Sirius. These operations seem impossible, 
and yet it is not senseless to contemplate them. By the same token, 
we shall not worry whether or not our conceptual experiments can be 
performed; we shall analyze abstract models. In the back of our minds 
we keep an intuitive interpretation of probability which gains opera¬ 
tional meaning in certain applications. We imagine the experiment 
performed a great many times. An event with probability 0.6 should 
be expected, in the long run, to occur sixty times out of a hundred. 
This description is deliberately vague but supplies a picturesque 
intuitive background sufficient for the more elementary applications. 
It can be rendered more precise only by a more elaborate theory. It 
does by no means describe typical applications. In fact, experiments 
to which probability theory is usefully applied are not necessarily 
repeated “a great many times under identical conditions/' The agri¬ 
cultural experimenter has only few replicas, and frequently he refuses 
to consider their conditions as identical (because of trends in fertility, 
etc.). The quality-control engineer applies probability theory to one 
particular machine and draws inferences (or guesses) concerning a set 
of circumstances which will never repeat itself. The telephone engineer 
applies probability theory in order to jgmpare various trunking systems 
of which ultimately only one Hornes int o existence. 

T he truth is that, like an mathematics, the theory of probability 
'builds theoretical models which are applied in many and'variable ways. 
T ^etechni que of applications can be understood onlyafter^thc theory. 
TheIntiiitTon with the theory. "Our statistical description of 



6 


INTRODUCTION 


\ probability suffices as an intuitive background for the beginning. It 
is vague, but, to use a simile of T. E. Lawrence, it i&“like the bow of 
the Mauretania: the bow has so much weight behind it that it does 
not need to be sharp like a razor.” 

v/ 

4. Historical Note 

Measure theory is a rather new branch of mathematics, but without 
its concepts and tools it would be impossible to treat probability theory 
along strictly mathematical lines. Until quite recently it was therefore 
necessary to admit non-mathematical elements, and most textbooks 
on probability are reminiscent of older books on mechanics in that 
they contain much philosophy and many attempts to define actual 
things rather than relations. 

j The advent of modern statistics put new requirements on probability 
[theory, and certain classical concepts proved inadequate. Great 
progress was achieved when R. A. Fisher and R. von Mises developed 
the notion of statistical probability with a definite operational meaning. 
To von Mises is due the notion of a sample space 2 representing all 
conceivable outcomes of an experiment. This notion filled a gap in 
the measure theoretical approach which was gradually emerging in 
the twenties under the influence of many authors (among them promi¬ 
nently Rademacher and Steinhaus). A complete axiomatic treatment 
of the foundations of probability theory was given by A. Kolmogorov, 3 
who emphasized the influence of von Mises’ ideas. Conceptually we 
shall follow this line, although in the first volume we deal only with 
discrete probabilities where all relations are so simple that the term 
“measure theory” appears too solemn. 

An unfortunate publicity was given to discussions of the so-called 
foundations of probability, and thus the erroneous impression was 
created that essential disagreement can exist among mathematicians. 
Actually these discussions concern only minor points which are of 
interest to but few specialists. The formulation of certain properties 
of randomness ranks high among von Mises’ contributions to prob¬ 
ability theory. To von Mises these properties served as axioms from 
which he undertook to develop the whole theory of probability. This 
enterprise was illuminating and served its historical purpose, but the 
theory encountered unexpected difficulties which completely destroyed 

2 Cf. his book, Wahrscheinlichkeitsrechnung, Leipzig and Wien, 1931, with refer¬ 
ences to his original papers dating back to about 1921. The German word is 
Merkmalraum (or label space). 

8 A. Kolmogoroff, Grundbegriffe der Wahrscheinlichkeitsrechnung, fasc. 3 of vol. 2. 
of Ergebnisse der Mathematik , Berlin, 1933. 



INTRODUCTION 


7 


its original simplicity. On the other hand, in the measure theoretical 
approach von Mises* randomness properties can be proved with sur¬ 
prising ease. In this respect there exists no disagreement concerning 
mathematical facts, and the argument revolves about the question of 
whether certain statements should occupy the place of axioms or of 
theorems. 

More serious is the contention that the modern approach is too 
general and embraces subjects which should be kept outside. It is 
true that the same objection is raised against other mathematical fields, 
but in probability theory the criticism was aimed specifically against 
the study of infinitely prolonged games. The so-called St. Petersburg 
paradox of the classical theory is supposed to show that a rational 
theory must exclude such cases and that only special sample spaces 
should be permitted. Now it turns out that certain physical problems 
connected with absorption probabilities and recurrence times are 
exactly of the St. Petersburg type, and the suggested limitation of the 
theory of probability would seriously reduce its usefulness. Actually, 
the measure theoretical approach leads to no paradoxes or difficulties, 
and its greatest advantage is that it substitutes theorems for vague 
discussions of paradoxes. It is easy to condemn and to decry theories 
as impractical. The foundations of practical things of today were so 
decried only yesterday, and the theories which will be practical to¬ 
morrow are branded as valueless abstract games by the practical men 
of today. 



CHAPTER 1 


THE SAMPLE SPACE 
Lr The Empirical Background 

\ The mathematical theory of probability gains practical value and 
an intuitive meaning in connection with real or conceptual experiments 
or phenomena such as tossing a coin once, tossing a coin 100 times, 
throwing a die, throwing three dice, arranging a deck of cards, matching 
two decks of cards, (playing craps, playing roulette, observing the length 
of life of a radioactive atom or of a person, selecting a random sample 
of people and observing the number of left-handers in it, the sex of a 
newborn baby, crossing two species of plants and observing the pheno¬ 
types of the offspring, the number of busy trunklines in a telephone 
exchange, the number of calls on a telephone, random noise in an elec¬ 
trical communication system, routine quality control of a production 
process, frequency of accidents, the number of double stars in a region 
of the skies, the position of a particle under diffusion. ^11 these descrip¬ 
tions are rather vague, and, in order to render the theory meaningful, 
we have to agree on what we mean by possible results of the experiment 
or observation in question. I 

A coin does not necessarily falflieads or tails; it can roll away or 
' stand on its edge. Nevertheless, we shall agree to regard “head” and 
“tail” as the only possiblTevents following the tossing of a coin. This 
convention simplifies the theory without affecting its applicability. 
Similar idealizations are frequently unavoidable. It is impossible to 
measure the length of life of an atom or of a person without some 
error, but for theoretical purposes it is expedient to imagine that these 
quantities are exact numbers. The question then arises as to which 
numbers can actually represent the life span of a person. Is there a 
maximal age beyond which life is impossible, or is any age conceivable? 
We hesitate to admit that man can grow 1000 years old, and yet cur¬ 
rent actuarial practice admits no such limit to life. According to 
formulas on which modern mortality tables are based, the proportion 
of men surviving 1000 years is of the order of magnitude of one in 
lO 10 * 6 —a number with 10 27 billions of digits. This statement does not 
make sense from a biological or sociological point of view, but con- 

8 



1.1] 


THE EMPIRICAL BACKGROUND 


9 


sidered exclusively from a statistical standpoint it certainly does not 
contradict any experience. There are fewer than 10 10 people born in a 
century. To test the contention statistically, more than 10 1038 centuries 
would be required, which is considerably more than lO 10 * 4 lifetimes of 
the earth. Obviously, such extremely small probabilities are com¬ 
patible with our notion of impossibility. One might think that their 
use is utterly absurd. Actually, it does no harm and is convenient in 
simplifying many formulas. Moreover, if we were seriously to discard 
the possibility of living 1000 years, we would be confronted with even 
worse difficulties, since we would have to accept the existence of a 
maximum age. However, the assumption that it should be possible 
to live x years and impossible to live x years and 2 seconds is as un¬ 
appealing as the idea of unlimited life. 

Any theory necessarily involves idealization, and usually the latter 
is so natural that it goes without saying. Our first idealization con¬ 
cerns the possible outcomes of an “experiment” or “observation.” 
Only these possible outcomes are objects of the mathematical theory. 
If we want to construct an abstract model of the experiment, we must 
at the outset reach a decision as to what constitutes a possible outcome 
of the (idealized) experiment. 

/For unif orm terminology, the results of experiments or observations^ 
w ill be called everjj s. Thus we shall speak of the 6V6T1T tMTflf 5 COlhs^ 
tossed more than* three fell heads. Similarly, the “experiment” of 
distributing the cards in bridge 1 may result in the “event” that North 
has two aces. The composition of a sa mple (“two left-handers in a 
sample of 85”) ah(T the result of a m easui:ein^nt (“ temperatyfe 120°,” 
“seven trunk lines busy”) will each be called an event . Now a singl e 
o bservation may refer simultaneously to se yeral events.^ Th us, if a 
throw with 2 dice resulted in the event “3 and 3,” then the same tria l 
resufi ecl ahn in t.hc following eve nts: “t.wn ndfl faces.” “sum 6.” “no 
aceT"’ etc. Th ese events are not mutually exclusive, so that seve ral 
of them can o ccur simultaneousl y. Th ey are compound events in the 
sense 1 that they carTt^Turther decomposed into simple events. “ Sum 
6^mea ns''(I, 5), or (2, 4), or JT, 3J, or WW, or W, 1V' wthleT <r two 

1 Definition of bridge and 'poker . A deck of bridge cards consists of 52 cards 
arranged in four suits of 13 each. There are thirteen face values (2, 3, • • •, 10, 
jack, queen, king, ace) in each suit. The four suits are called spades, clubs, hearts, 
diamonds. The last two are red, the first two black. Cards of the same face 
value are called of the same kind. For our purposes, playing bridge means dis¬ 
tributing the cards to four players, to be called North, South, Last, and West (or 
N, S, E, W, for short) so that each receives 13 cards. Playing poker, by definition^ 
means selec +^p; ft gagd a m t jhfi pack- We shall also study the composition of 
a "hand of r bridge cards without reference to any particular game. 




10 


(1.2 


THE SAMPLE SPACE? 

odd faces” is an abbreviation for “(1, 1), or (1, 3), or (1, 5), or (3, 1), 
or In poker, every individual hand of 5 cards constitutes a 

simple (indecomposable) event; there are 2,598,960 such simple events 
or hands. The event “a hand contains two aces and no more” is a 
compound event and could be described by an enumeration of the 
103,776 hands containing exactly two aces. Every particular numerical 
value x for a temperature represents a simple event, while the state¬ 
ment “the temperature is in the fifties” describes the compound event 
50 < x < 60 .1 Every compound event can be decomposed into simple t 
events, that is to say, a compound event is an aggregate of certain simple ' 
events . 

If we want to speak about “experiments” or “observations” in a 
theoretical way and without ambiguity, we must first agree on the 
simple (not further decomposable) events representing the thinkable 
outcomes; they define the idealized experiment. It is usual to refer 
^to these simple events as sample points , or points for short. By defini¬ 
tion, every indecomposable result of the ( idealized) experiment is repre¬ 
sented by one , and only one f sample point. The aggregate of all sample 
points will be called the sample space. All events connected with a 
given (idealized) experiment can be described in terms of sample 
points. c 

2. Illustrative Examples 

(a) Sampling . Suppose that a sample of 100 people is taken in' 
order to estimate how many people smoke. The only property of the 
sample which is of interest in this connection is the number x of 
smokers; this may be any integer between 0 and 100. In this case we 
may agree that our sample space consists of the 101 “points” x = 0, 

1, • • •, 100. The result of every particular sample or observation is 
completely described by stating the corresponding point x. An ex¬ 
ample of a compound event is the result that “the majority of the people 
sampled are smokers.” This means that the experiment resulted in one 
of the fifty events x = 51, 52, • • •, 100, but it is not stated in which. 
Similarly, every property of the sample can be described in enumerating 
the corresponding cases or sample points. For uniform terminology 
we speak of events rather than properties of the sample. Mathe-1 
matically, an event is simply the aggregate of the corresponding sample 
points. 

(b) Sampling {Continued). Suppose now that the 100 people in our v 
sample are classified not only as smokers or non-smokers but also as 
males or females. The sample may nbw be characterized by a quad- 
ruple (M tJ F 8 , M n , F n ) of integers giving in order the number of male 



1 . 2 ] 


ILLUSTRATIVE EXAMPLES 


11 


and female smokers, male and female non-smokers. We can take for 
sample space the aggregate of all quadruples of integers lying between 
0 and 100 and adding up to 100. Stating t hat sample “relatively 
more males than females smoke” jneansthat In our sample the lati o 
MJM n \sj greater thfl Thepomt (75“," % 8, 17) has this prop- 

erty, but (0, 1, .50, 49) has not. Our event can be described in principle 
by enumerating all quadruples with the desired property. 

(c) Arrangements of Distinguishable Objects. Consider four objects 
a, b, c, d. We conceive of their order as the result of an experiment. 
There are 24 possible arrangements or sample points, namely, abed, 
abdCy acbd, aedb, • • *, dcab, deba. The event A, “a at the first place,” 
can be described by enumerating the six cases in which it occurs: A is 
the aggregate of the six sample points abed , abdc, acbd , aedb , adbc, adeb. 
We shall say that A consists of these six points. Consider now the’ 
three analogous events B, C, D defined by the property that b, c, d 
occupy the second, third, and fourth place, respectively. The event B 
consists of the six sample points abed, abdc, chad, cbda, dbac, dbca. 
There are two points common to A and B, namely, abed and abdc ; if 
one of these two arrangements was observed, we would say that both 
A and B had occurred. The point abcxl is common to all four events 
A, B, C, D. Finally, consider the event E that “two and no more 
letters occupy their alphabetical place.” This event consists of the 
six points abdc , adeb, acbd , dbca , chad, bacd. 

(d) Arrangements of Indistinguishable Objects. In the above ex¬ 
ample, the objects may be two red and two black balls, and we may 
consider balls of the same color as indistinguishable. Denoting the 
two colors by r and b, we are concerned with all possible arrangements 
of the symbols r, r, b, b. Our sample space now has only six points, 
namely, rrbb, rbrb, rbbr , brrb, brbr, bbrr. As examples of compound 
events, let us consider two events Ii and B defined by the condition 
that the symbols r and b, respectively, are not separated. The event R 
contains the sample points rrbb, brrb, bbrr, that is to say, R is realized 
if, and only if, one of these three arrangements occurs. Similarly, the 
event B contains rrbb, rbbr, bbrr. Both R and B occur simultaneously 
in the arrangements rrbb and bbrr ; the event “both R and B occur” 
contains these two points and no more. Finally, the event “either 
R or B or both” consists of all points except rbrb and brbr where 
neither R nor B occurs. 

Note that the question of whether or not actual balls of the same 
color are distinguishable is not an object of our theory: so far as the 
mathematics is concerned, this is a matter of definition or agreement. 
In practice, the same experiment may %e described on either assump- 



12 


THE SAMPLE SPACE 


[1.3 


tipn, depending on the purpose of the theory. For example, in certain 
games the face values of playing cards may be irrelevant; in such cases 
we shall agree not to distinguish cards of the same suit. Similarly, 
people are actually distinguishable, but if we are interested only in their 
grouping according to sex, we shall naturally consider people of the 
same sex as indistinguishable. 

( e) Coin Tossing . If a coin is tossed three times, the sample space, 
will consist of eight points which may conveniently be represented by 
HHH , HHT, HTH , THE, HTT y THT y TTH , TTT . The event A, 
“two or more heads / 9 can be described as the aggregate of the first 
four points. The event B “just one tail,” means either HHT , or HTH , 
or THE] we say that B contains these three points. 

(/) Ages of a Couple. An insurance company is interested in the 
‘age distribution of couples. Let x stand for the age of the husband, 
y for the age of the wife. Each observation results in a number-pair 
( x , y). For the sample space corresponding to a single observation 
we take the first quadrant of the x y y- plane so that each point x > 0, 
y > 0 is a sample point. The event A y “husband is older than 40,” 
is represented by all points to the right of the line x = 40; the event 
B y “husband is older than wife,” is represented by the angular region 
between the z-axis and the bisector y = x y that is to say, by the aggre¬ 
gate of points with x > y; the event C, “wife is older than 40,” is 
represented by the portion of the first quadrant above the line y = 40. 
For a geometric representation of the joint age distributions of two 
couples we would require a four-dimensional space. ^ 

(g) Phase Space. In statistical mechanics, each possible “state” 
of a system is called a “point in phase space.” This is only a difference 
in terminology. The phase space is simply our sample space; its points- 
are our sample poifttS/ 

3. The Sample Space. Events 

It should be clear from the preceding that we shall never speak of 
probabilities except in relation to a given sample space (or physically: 
in relation to a certain conceptual experiment). We start with the 
notion of sample space and its points; from now on they will he considered 
given . They are the primitive and undefined notions of the theory precisely 
as the notions of “points” and “straight line” remain undefined in an 
axiomatic treatment of Euclidean geometry. The nature of the sample 
points does not enter our theory. The sample space provides a model 
of an ideal experiment in the sense that, by definition, every thinkable 
outcome of the experiment is completely described by one y and only one y 
sample mint. It is meaningful to talk about an event A only when it 



1.4] 


RELATIONS AMONG EVENTS 


13 


is clear for every outcome of the experiment whether the event A has 
or has not occurred. The collection of all those sample points which 
represent outcomes where A has occurred completely describes the 
event. Conversely, any given aggregate A containing one or more 
sample points can be spoken of as an 
event; it does, or does not, occur according 
as the outcome of the experiment is, or is 
not, represented by a point of the aggre¬ 
gate. We therefore define the word event 
: to mean the same as an aggregate of sample 
points. We shall say that an event A con¬ 
sists of (or contains) certain points , namely, 
those representing outcomes of the ideal 
experiment in which A occurs. 

The terms “event” and “sample point” 
have an intuitive appeal, but these notions 
are equivalent to point sets and points in 
all parts of mathematics. 



Figure 1 . Unions and Inter¬ 
sections of Events. The do¬ 
main within heavy bounda¬ 
ries is the union A U B U C. 
The triangular (heavily shaded) 
domain is the intersection 
ABC. The moon-shaped 
(lightly shaded) domain is the 
intersection of B with the 
complement of A U C. 


4. Ablations among Events 

yfWe shall now suppose that an arbitrary, 
tmt fixed, sample space §> is given. To 

every event A there corresponds another event defined by the con¬ 
dition “A does not occur.” It contains all points not contained in A. 

Definition 1. The event consisting of all points not contained in the 

event A will he called the complementary 
event (or negation) of A and will he denoted 
by A. 

t will occasionally occur that an event 
is defined by certain conditions but 
[Tat a closer analysis reveals that 
A is impossible. To have a convenient 
notation for this case, we introduce 



Figure 2. Intersections and 
Differences of Events. • 

* Definition 2. 


(4.1) 


We shall use the notation 
A = 0 


to express that the event A contains no sample points (is impossible). 
The zero in (4.1) must be interpreted in a symbolic sense and not as 
the number. 

With any two events A and B we can associate two new events 



14 


THE SAMPLE SPACE 


[1.4 


defined by the conditions “both A and B occur ” and “either A or B 
or both occur” These events will be denoted by AB and A U B, 
respectively. The event AB contains all sample points which are 
common to A and B . If A and B exclude each other, then there are 
no points common to A and B and the event AB is impossible; analyt¬ 
ically, this situation is described by the equation 

(4.2) AB = 0 

which should be read “A and B are mutually exclusive .” The event 
AS means that both A and B occur or, in other words, that A but not 
B occurs. Similarly, A B means that neither A nor B occurs. The 
event A U B means that at least one of the events A and B occurs; 

Jt contains all sample points except those which belong neither to 
A nor to B. 

Examples, (a) In the example (2.c), the event AB is defined by the 
conditions that a occupies the first and b the second place; thus, AB 
contains the two points abed and abde. The event A U B contains 
the ten points abed , abde, acbd , aedb , adbc, adeb , ebad, cbda, dbac , dbca. 
The event AB consists of acbd, aedb , adbc , adeb. 

(b) In the example (2 .d), the event BUB contains all points except 
rbrb and brbr, while RB consists of rrbb and bbrr. The event BB con¬ 
tains only the point brrb. 

(c) In the example (2./), the event AB means that the husband is 
older than 40 and older than his wife, while AB means that he is older 
than 40 but not older than his wife. Then AB is represented by the 
infinite trapezoidal region between the j-axis and the lines x = 40 
and y = x. The event AB is represented by the angular domain 
between the lines x = 40 and y — x, the latter boundary included. 
The event AC means that both husband and wife are older than 40. 
The event A U C means that at least one of them is older than 40, 
while A U B means that the husband is either older than 40 or, if 
not that, at least older than his wife (in official language “husband’s 
age exceeds 40 years or wife’s age, whichever is smaller”). 

In the theory of probability we can describe the event AB in words J 
as the simultaneous occurrence of A and B. In standard mathematical 
terminology AB is called the (logical) intersection of A and B. Sim¬ 
ilarly, A U B is the union of A and B. Our notion carries over to the 
case of events A, B, <7, D, .... We can still define two new events 
ABCD ... and AUBUCUDU ... consisting, respectively, in 
the simultaneous realization of all events A, B, C, D, ... and in the 
realization of at least one among the events A, B, C, D, .... 



1.41 RELATIONS AMONG EVENTS 15 

^Cpefinition 8 . To every collection A, B, C, . > . of events we define two 
new events ns follows. The aggregate of the sample points which belong 
to all the given sets will be denoted by ABC ... and called the intersection 
{or simultaneous realization) of A, B, C, .... The aggregate of sample 
points which belong to at least one of the given sets will be denoted by 
A U B U C . .. and called the union {or realization of at least one) of the 
given events. The events A , B, C, ... are mutually exclusive if no two 
have a point in common , that is } if AB = 0 , AC = 0 , • •,= 0 ,_ 

Example, {d) Bridge (cf. footnote on p. 9). Let A, B, C, D be 
the events, respectively, that North, South, East, West have at least 
one ace. It is clear that at least one player has an ace, so that at 
least one of the four events must occur. Hence AUjBUCUI) = @ 
is the whole sample space. The event A BCD occurs if, and only if, 
each player has an ace. The event “West has all four aces” means 
that none of the three events A, B ) C has occurred; this is the same 
as the simultaneous occurrence of A and B and C or the event ABC. 


We still require a symbol to express the statement that A cannot 
occur without B occurring or that the occurrence of A implies the 
occurrence of B. This means that every point of A is contained in B. 
One should think of intuitive analogies like the aggregate of all mothers, 
which forms a part of the aggregate of all women: all mothers are 
women but not all women are mothers. 

^Definition 4 . If every point of A is contained in B , we shall write 
A CL B and say that A implies B. Alternatively , we shall also write B ZD A 
and say that B is implied by A. In that case , we shall also write B — A 
instead of B A to denote the event that B but not A occurs. 

The event B — A contains all those points which are in B but not 
in A. With this notation we can write A = @ — A and A — A = 0. 

Examples, {e) If A and B are mutually exclusive, then the occur¬ 
rence of A implies the non-occurrence of B and vice versa. Thus 
AB = 0 means the same as A d B and as B C A. 

(/) The event A — AB means the occurrence of A but not of both 
A and B . Thus A — AB = AB. 

{g) In the example (2.c), if <z, 6, and c are in their natural places, so 
is d. Hence ABC C D. The event D - ABC means that d occupies 
the fourth place, but not all three remaining letters are in their natural 
places. This event consists of the five arrangements acbd } cbad, bacd , 
bead , cabd. 



16 


THE SAMPLE SPACE 


[1.6 

( h ) In the example (2./) we have BC C A; in words “if husband 
is older than wife ( B ) and wife is older than 40 (C), then husband is 
older than 40 (A)” How can the event A — BC be described in words? 

5. Discrete Sample Spaces 

The simplest sample spaces are those which contain only a finite 
number, n, of points. If n is fairly small (as in the case of tQSsing a 
few coins), it is easy to visualize the space. The space of distributions 
of cards in bridge is more complicated. However, one may imagine 
each sample point represented on a chip and may then consider the 
collection of these chips as representing the sample space. An event A 
(like “North has two aces”) is represented by a certain set of chips; 
A , by the remaining ones. It takes only one step from here to imagine 
a bowl with infinitely many chips or a sample space with an infinite 
sequence of points E Xy E 2 , E 3 , 

Examples, (a) Let us agree to toss a coin as often as necessary to 
turn up one head. The points of the sample space are then Ei = //, 
E 2 = TH y E 3 = TTHy E 4 = TTTHy etc. We may or may not con¬ 
sider as thinkable the possibility that H never appears. If we do, this 
possibility should be represented by a point E 0 . 

(b) Craps . This game requires the player to throw 2 dice until a 
decision has been reached under the following condition. He wins if 
the first throw results in the sums 7 or 11 or, alternatively, if the first 
sum is 4, 5, 6, 8, 9, or 10, and the same sum reappears before the 7 
has appeared. This implies that the player loses if the first throw 
results in the sum 2, 3, or 12 or, alternatively, if the first sum is 
4, 5, 6, 8, 9, or 10, and the sum 7 appears before the first sum reappears. 
There are clearly infinitely many possible results, since it is conceivable 
that the game will not end in n trials, however large n may be. Never¬ 
theless, it is easy to enumerate all sample points according to the 
number of throws required. One throw decides the game if the sum 
is 2, 3, 7, 11, or 12. Two throws decide if they result in (4, 4), (4, 7), 
(5, 5), (5,7), • • •, (10,10), (10, 7), etc. Thus a systematic enumeration 
of all sample points is tedious, but possible. The rules of the game 
would be senseless if we were not convinced that the game will be 
decided sooner or later so that we need not worry about the possibility 
of an unending game. 

Definition . A sample space is called discrete if it contains only finitely 
many points or infinitely many points which can be arranged into a simple 
sequence E X) E 2 , _ 



1 . 6 ] 


17 


PROBABILITIES IN DISCRETE SAMPLE SPACES 

Not every sample space is discrete. It is a known theorem (due to 
G. Cantor) that the sample space consisting of all positive numbers 
is not discrete. We are here confronted with a distinction familiar 
in mechanics, where one usually first considers discrete mass points, 
with each individual point carrying a finite mass. This concept con¬ 
trasts with continuous mass distribution, where each individual point 
has zero mass. In the first case, the mass of a system is obtained 
simply by adding the masses of the individual points; in the second 
case, masses are computed by integration over mass densities. Quite 
similarly, the probabilities of events in discrete sample spaces are 
obtained by mere additions, whereas in other spaces integrations are 
necessary. Except for the technical tools required, there is no essential 
difference between the two cases. In order to present actual probability 
considerations unhampered by technical difficulties, we shall first take 
up only discrete sample spaces. It will be seen that even this special 
case leads to many interesting and important results. 

In this volume we shall consider only discrete sample spaces. 

6. Probabilities in Discrete Sample Spaces 

The probabilities of the various events are numbers of the same 
nature as distances in geometry or masses in mechanics. The theory 
assumes that they are given but need assume nothing as to their actual 
numerical value or how they are measured in practice. Some of the 
most important applications are of a qualitative nature and inde¬ 
pendent of numerical values; the general conclusions of the theory are 
applied in many ways exactly as the theorems of geometry serve as a 
basis for physical theories and engineering applications. In the rela¬ 
tively few cases where numerical values for probabilities are required 
the methods of procedure vary as widely as do the methods of deter¬ 
mining distances. There is little in common in the practices of the 
carpenter, the practical surveyor, the pilot, and the astronomer, when 
they measure distances. In our context, we may consider the diffusion 
constant, which is a notion of the theory of probability. To find its 
numerical value physical considerations relating it to other theories 
are required; a direct measurement is impossible. By contrast, mor¬ 
tality tables are constructed from rather crude observations. In most 
actual applications the determination of probabilities, or the compari¬ 
son of theory and observation, requires rather sophisticated statistical 
methods which in turn are based on a refined probability theory. In 
other words, the intuitive meaning of probability is clear, but only 
as the theory proceeds shall we be able to see how it is applied. All 



18 THE SAMPLE SPACE [1.6 

possible “definitions” of probability fall far short of the actual, 
practice. 

When tossing a “good” coin we do not hesitate to associate prob¬ 
ability 1/2 with either head or tail. This amounts to saying that when 
a coin is tossed n times all 2 n possible results have the same probability. 
From a theoretical standpoint, this is a convention . Frequently, it 
has been contended that this convention is logically unavoidable and 
the only possible one. Yet there have been philosophers and statisti¬ 
cians defying the convention and starting from contradictory assump¬ 
tions (uniformity or non-uniformity in nature). It has also been 
claimed that the probabilities 1/2 are due to experience. As a matter 
of fact, whenever refined statistical methods have been used to check 
on actual coin tossing, the result has been invariably that head and 
tail are not equally likely. And yet we stick to our model of an “ideal” 
coin, even though no good coins exist. We preserve the model not 
merely for its logical simplicity, but essentially for its usefulness and 
applicability. In many applications it is sufficiently accurate to 
describe reality. More important is the empirical fact that departures 
from our scheme are always coupled with phenomena such as an 
eccentric position of the center of gravity. In this way our idealized 
model can be extremely useful even if it never applies exactly. For 
example, in modem statistical quality control based on Shewhart’s 
methods, idealized probability models are used to discover “assignable 
causes” for flagrant departures from these models and thus to remove 
impending machine troubles and process irregularities at an early stage. 

Similar remarks apply to other cases. The number of possible dis¬ 
tributions of cards in bridge is almost 10 30 . Usually, we agree to 
consider them as equally probable. For a check of this convention 
more than 10 30 experiments would be required—a billion of billion of 
years if every living person played one game every second, day and 
night. However, consequences of the assumption can be verified 
experimentally, for example, by observing the frequency of multiple 
aces in the hands at bridge. It turns out that for crude purposes the 
idealized model describes experience sufficiently well, provided that the 
card shuffling is done better than is usual. It is more important that 
the idealized scheme, when it does not apply, permits the discovery of 
“assignable causes” for the discrepancies, for example, the reconstruc¬ 
tion of the mode of shuffling. These are examples of limited impor¬ 
tance, but they indicate the usefulness of assumed models. More 
interesting cases will appear only as the theory proceeds. 

Fundamental Convention . Given a discrete sample space © with the 
sample points E u E 2 , ...» we shall assume that with each point E$ there 



1 . 6 ] 


PROBABILITIES IN DISCRETE SAMPLE SPACES 


19 


is associated a number , called the probability of Ej and denoted by Pr{Ej\. 
It is to be nonrnegative and such that 

(6.1) Pr{E x \ +Pr{E 2 \ + ... = 1. 

Note that we do not exclude the possibility that a point has prob¬ 
ability zero. This convention may appear artificial but is necessary 
to avoid complications. In discrete sample spaces probability zero is 
in practice interpreted as an impossibility, and any sample point 
known to have probability zero can, with impunity, be eliminated from 
the sample space. However, frequently the numerical values of the 
probabilities are not known in advance, and involved considerations 
are required to decide whether or not a certain sample point has 
positive probability. 

Definition . The probability Pr{A\ of any event A is the sum of the 
probabilities of all sample points in it. > 

The fundamental equation (6.1) states that the probability of the 
entire sample space @ is unity, or ?r[@) = 1. It follows that for 
any event A 

(6.2) 0 < Pr[A } < 1. 

Consider now two arbitrary events Ai and A 2 . To compute the 
probability Pr\A x U A 2 } that either A x or A 2 or both occur, we have 
to add the probabilities of all sample points contained either in A x 
or in A 2 , but each point is to be counted only once. We have, therefore, 

(6.3) Pr\A x U A 2 } < Pr{A x } + Pr\A 2 }. 

Now, if E is any point contained both in A x and in A 2 , then Pr{E\ 
occurs twice in the right-hand member but only once in the left-hand 
member. Therefore, the right side exceeds the left side by the amount 
Pr{A x A 2 ) } and we have the simple but important 

Theorem . For any two events A x and A 2 the probability that either 
A i or A 2 or both occur is given by 

(6.4) Pr{A x U A 2 } = Pr{A x ) + Pr{A 2 } - Pr\A X A 2 ). 

If A X A 2 = 0, that is, if A x and A 2 are mutually exclusive, then (6.4) 
reduces to 


(6.5) 


Pr{A x U A 2 ] = Pr{A x } + Pr{A 2 \. 



20 


THE SAMPLE SPACE 


tl.6 


Example. A coin is tossed twice. For sample space we take the 
four points HH, HT, TH , TT, and associate with each probability 1/4. 
Let A i and A 2 be, respectively, the events “head at first and second 
trial.” Then Ai consists of IIH and HT, and A 2 of TH and HH. 
Furthermore A = A x U A 2 contains the three points HH, HT, and 
TH, whereas A X A 2 consists of the single point HH. Thus Pr{A x U A 2 \ 
_ 1 i 1 _ 3 

~ 2 + 2 4 ~ 4 

The probability Pr{Ai U A 2 U • • • U A n ) of the realization of at 
least one among n events can be computed by a formula analogous to 
(6.4); this will be taken up in Chapter 4, section 1. Here we note only 
that the inequality (6.3) obviously holds in general. Thus for arbitrary 
events A\, A 2 , . .. the inequality 

(6.6) Pr{A x Ui 2 U-)< Pr\A x ) + Pr{A 2 ) + ... 

holds. In the special case where the events A x , A 2 , ... are mutually 
exclusive, we have 

(6.7) Pr{A x U A 2 U •••} = Pr{A x \ + Pr{A 2 \ + ... . 

Occasionally (6.6) is referred to as Boole’s inequality. 

We shall first investigate the simple special case where the sample 
space has a finite number, N, of points each having probability l/N. 
In this case, the probability of any event A equals the number of 
points in A divided by N. In the older literature, the points of the 
sample space were called “cases” and the points of A, “favorable” 
cases (favorable for A). If all points have the same probability, then 
the probability of an event A is the ratio of the number of favorable 
cases to the total number of cases. Unfortunately, this statement has 
been much abused to provide a “definition” of probability. It is 
often contended that in every finite sample space probabilities of all 
points are equal. This is not so. For a single throw of an untrue coin, 
the sample space still contains only the two points, head and tail, but 
they may have arbitrary probabilities p and q, with p + q = 1. A 
newborn baby is a boy or girl, but in applications we have to admit 
that the two possibilities are not equally likely. In many applications 
in physics and technics, we have simple alternatives with different 
probabilities. The usefulness of sample spaces in which all sample 



1.7] 


PROBLEMS FOR SOLUTION 


21 


points have the same probability is restricted almost entirely to the 
study of games of chance and to combinatorial analysis. 

7. Problems for Solution 

1. Among the digits 1 , 2 , 3, 4, 5 first one is chosen, and then a second selection 
is made among the remaining four digits. Assume that all twenty possible results 
have the same probability. Find the probability that an odd digit will be selected 

(а) ,the first time, (b) the second time, (c) both times. 

2. A coin is tossed until for the first time the same result appears twice in succes¬ 
sion. To every possible outcome requiring n tosses attribute probability l/ 2 n . 
Describe the sample space. Find the probability of the following events: (a) the 
experiment ends before the sixth toss, ( 6 ) an even number of fosses is required. 

3. Two dice are thrown. Let A be the event that the sum of the faces is odd, 
B the event of at least, one ace. Describe the events AB.J i U B, AE. Find their 
probabilities, assuming that all 3G sample; pomtsTiave equal probabilities. 

4. in the example ( 2 ./), discuss the meaning of the following events: (a) ABC , 

( б ) A - AB, (c) ABC. 

5. In the example (2./), verify that AC d B. 

6 . Bridge (of. footnote on p. 9). For k ^ 1, 2 , 3, 4, let Nk be the event that 

North has at least k aces. Let Sk, Ei c , Wk be the analogous events for South, East, 

West. What can lx; said about the number x of aces in West’s possession in the 
events (a) W h (ft) N»% (r.) ViSiB,; (</) W 2 - W 3 , (e) NAEiWi, (/) N 3 W X , (g) 
(iV 2 U S 2 )E 2 . 

7. In the preceding problem verify that (a) S 3 CZ 82 , (b) 83 W 2 = 0, (c) N 28 W 1 
= 0 , (d) N 2 S 2 C Wu (e) (JV 2 U 6 W 3 = 0 , (/) Wi = NAEl 

8 . Verif y the following relations 2 : 

(a) A U B = IB. (e) (.A U B) - AB t = AB U IB. 

(ft) (/I U B) - B = A - AB = AH. (/) A 1) E = IB. 

(r) AA = A U A = A. (g) (A U B)C = AC U BC. 

(d) (.4 - AB) U B = A U B. 

9. Find simple expressions for (a) (A U B)(A U E), (ft) (A U B)(I U B){A U E), 

(c) (A U B)(B U C). ; 

10 . State which of the following relations are correct and which incorrect: 

(а) (AUB)-C»4U(B - C). 

( б ) ABC = AB(C U JB). 

(c) AUbUC-A U (B - AB) U (C - AO. 

(d) Al)B = (A-AB)UB, 

(e) AB U BC U CA Z) ABC. 

(/) (AB U-BC U CA) C (A U B U C), 

(ff) (A U B) - A = B. 

( h ) ABC CAU B. 

({) A U B U C - I EE. 

O') (A U B) C = IC U SC. 

(fc) (A U B) C = I EC. 

(1) (A U B)C = C - C(A U BX, 

1 Note that A U B denotes the complement of A U B which is not the same as 
I U E. Similarly, AB is not the same as I E. 



22 THE SAMPLE SPACE [1.7 

11. Let A, B, C be three arbitrary events. Find expressions for the events that 
of Ay By C: 

(a) Only A occurs. (/) One and ,no more occurs. 

(b) Both A and B f but not C, occur. (g) Two and no more occur. 

(c) All three events occur. (h) None occurs. 

(d) At least one occurs. (i) Not more than two occur. 

(e) At least two occur. 

12. The union A U B of two events can be expressed as the union of two mutually 
exclusive events, thus: i U B = A U (B - AB). Express in a similar way the 
union of three events A, B, C. 



CHAPTER 2 


ELEMENTS OF COMBINATORIAL ANALYSIS 
STIRLING’S FORMULA 

The following two chapters on combinatorial analysis are a necessary 
interruption of the main course of the book. We want to derive a few 
important formulas in a way appropriate for our purposes. The ad¬ 
vanced reader may pass on directly to Chapter 4, where the thread of 
Chapter 1 is taken up again. 

In the study of simple games of chance, sampling procedures, occu¬ 
pancy and order problems, etc., we are usually dealing with finite 
sample spaces in which the same probability is attributed to all points. 
To compute the probability of an event A we have then only to divide 
the number of sample points in A (“favorable cases”) by the total 
number of sample points (“possible cases”). This is facilitated by a 
systematic use of a few rules which we now proceed to review. Sim¬ 
plicity and economy of thought can be achieved by adhering to a few 
standard tools, and we shall follow this procedure instead of describing 
the shortest computational method in each special case. 1 

1. Preliminaries 

Pairs. With m elements a iy • • *, a m and n elements b i, • • •, b n it is 
possible to form mn pairs (ay, bk) containing one element from each group . 

Proof . Arrange the pairs in a rectangular array in the form of a 
multiplication table with m rows and n columns so that (ay, bk) stands 
at the intersection of the jth row and fcth column. Then each pair 
appears once and only once, and the assertion becomes obvious. 

Examples, (a) Bridge Cards (cf. footnote to Chapter 1, section 1). 
As sets of elements take the four suits and the thirteen face values, 

1 The interested reader will find many topics of elementary combinatorial analysis 
treated in the classical textbook, Choice arid Chance , by W. A. Whitworth, fifth 
edition, London, 1901, reprinted by G. E. Stechert, New York, 1942. The com¬ 
panion volume by the same author, DCC Exercises , reprinted New York, 1945, 
contains 700 problems with complete solutions. 

23 



24 COMBINATORIAL ANALYSIS [2.2 

respectively. Each card is defined by its suit and its face value, and 
there exist 4-13 = 52 such combinations, or cards. 

(b) “Seven-way Lamps” Some floor lamps so advertised contain 
3 ordinary bulbs and also an indirect lighting fixture which can be 
operated on three levels but need not be used at all. Each of these 
four possibilities can be combined with 0, 1, 2, or 3 bulbs. Hence there 
are 4*4 = 16 possible combinations of which one, namely (0, 0), 
means that no bulb is on. There remain fifteen (not seven) ways of 
operating the lamps. 

Multiplets. Given n x elements a\, • • •, a nv and n 2 elements hi , • • •, b n2 , 
etc,, up to n r elements x\, • • •, x nr ; it is possible to form n x -n 2 - • • • n r 
combinations (a ;i , b jv • • •, x jr ) containing one element of each kind. 

Proof, If r = 2, the assertion reduces to the first rule. If r = 3, 
take the pair (a*, bj) as element of a new kind. There are n x n 2 such 
pairs and n 3 elements c&. Each triple (a*, bj, c&) is itself a pair consisting 
of (ai, bj) and an element c*; the number of triplets is therefore n\n 2 n%. 
Proceeding by induction, the assertion follows for every r. 

Examples, (c) Multiple Classifications. Suppose that people are 
classified according to sex, marital status, and profession. The various 
categories play the role of elements. If there are seventeen professions, 
then we have 2*2*17 = 68 classes in all. 

(d ) In an agricultural experiment three different treatments are to 
be tested (for example, the application of a fertilizer, a spray, and 
temperature).* If these treatments can be applied on r\, r 2 , and r 3 
levels or concentrations, respectively, then there are rir 2 r% combina¬ 
tions, or ways of treatment. 

2. Samples 

Consider the set or “population” of n elements a\, a 2 , • • *, a n . Any 
ordered arrangement of r symbols is called a sample of size r drawn 
from our population. It is understood that the sample is ordered, 
and we write it in the form (a jv a iv • • •, a jr ). For an intuitive picture 
we can imagine that the elements are selected one by one. Two pro¬ 
cedures are then possible. First, sampling with replacement; here each 
selection is made from the entire population, so that the same element 
can be drawn more than once. The samples are then arrangements 
in which repetitions are permitted. Second, sampling without replace - v 
ment; here an element once chosen is removed from the population, so 
that the sample becomes an arrangement without repetitions. Ob¬ 
viously, in this case, the sample size r cannot exceed the population 
size n. 



2.2] SAMPLES 25 

N 

In sampling with replacement each of the r elements can be chosen 
in n ways: the number of possible samples is therefore n r , as can be 
seen from the last theorem with n\ = n 2 = • • • = n. In sampling with¬ 
out replacement we have n possible choices for the first element, but 
only n — 1 for the second, n — 2 for the third, etc. Using the same 
rule, we see that without replacement the number of samples is 
n(n — 1) •••(?! — r+1). Such products appear so often that it is 
convenient to introduce the notation 2 

(2.1) ( n)r — n(n — 1) • • • (n — r + 1). 

Clearly (n) r = 0 whenever r > n. We have thus the following 

Theorem . The number of different possible samples of size r from a 
population of n elements is n r if the sampling is with replacement, and 
( n) r if it is without replacement . 

Mr. and Mrs. Smith form a Sample of size two drawn from the 
human population; at the same time, they form a sample of size one 
drawn from the population of all couples. This example shows that 
the sample size is defined only in relation to a given population. Toss¬ 
ing a coin r times is one way of obtaining a sample of size r drawn from 
the population of the two letters, II and T. The same arrangement of r 
letters H and T is a single sample point in the space corresponding to 
the experiment of tossing a coin r times. 

Drawing r elements from a population of size n is an experiment 
whose possible outcomes are samples of size r. Their number is n r or 
(ji) r , depending on whether or not replacement is used. In either case, 
our conceptual experiment is described by a sample space in which 
each individual point represents a sample of size r. 

So far we have not spoken of probabilities associated with our 
samples. Usually we shall assign equal probabilities to all of them and 
then speak of random samples. The word “random” is not well 
defined, but when applied to samples or selections it has a unique 
meaning. Whenever we speak of random samples of fixed size r, the 
adjective random is to imply that all possible samples have the same 
probability, namely, n~ r in sampling with replacement and l/(n) r in 
sampling without replacement, n denoting the size of the population 
from which the sample is drawn. If n is large and r relatively small, 
the ratio ( n) r /n r is near unity. This suggests that for large populations 
and relatively small samples the difference between the two ways of 
sampling is negligible [cf. Chapter 6, example (2.</)]. 

We have introduced a practical terminology but have made no 

2 The notation (n) r is not standard, but it will be used consistently in this book. 



26 


COMBINATORIAL ANALYSIS 


[2.3 


statements as to the applicability of our model of random sampling 
to reality. Tossing coins, throwing dice, and similar activities may be 
interpreted as experiments in practical random sampling with replace¬ 
ments, and our probabilities are numerically close to frequencies ob¬ 
served in long-run experiments even though perfectly balanced coins 
or dice do not exist. Random sampling without replacement is typified 
by successive drawings of cards from a shuffled deck (provided that 
shuffling is done much better than is usual). In sampling human popu¬ 
lations the statistician encounters considerable and often unpredictable 
difficulties, and bitter experience has shown that it is difficult to obtain 
even a crude image of randomness. 

3. Examples 

(a) Random Sampling Numbers. To compare our model of coin 
tossing with actual experiments, we require the records of a long 
series of trials. We could then compare various probabilities with 
the observed frequencies of corresponding events. Records of the 
desired kind are not readily available, but Kendall and Smith 3 have 
published 100,000 random digits, that is, a record of 100,000 trials 
representing random sampling in the population of the digits 0, 1, 
• • •, 9. At any point of these tables the next r digits should represent 
a random sample as closely as one can expect observations to correspond 
to a theoretical model. 

As an example, consider the event that the digit is 7. The probability 
is 1/10, and we expect that among r digits the 7 will occur approxi¬ 
mately r/10 times. The observed frequencies of the occurrence of 7 
in the first 100 batches of 100 digits each (first 100 columns in the 
tables) are given in Table 1. As would be expected, the frequencies 
fluctuate around 0.1. If we take larger samples, the fluctuation of 
the frequencies is smaller. This is borne out by the last column of 
Table 1 which records the average frequencies of the digit 7 for ten 
batches of 1000 digits each. It goes without saying that a more 
advanced theory is required to judge to what extent empirical data 
like those in Table 1 agree with our abstract model. We shall return 
to this material in Chapter 3, and again in Chapter 7, example (3.a), 
when discussing the theory of independent trials and observations. 
There we shall find that in about four out of ten counts of 10,000 ideal 
random digits the frequency of the digit 7 must be expected to fall 
outside the interval 0.10 =t 0.0032. This means that an agreement as 

8 M. G. Kendall and Babington Smith, Tables of Random Sampling Numbers , 
Tracts for Computers No. 24, Cambridge, 1940. Older and smaller tables under 
the same title by L. H. C. Tippett are in the same series, No. 16. 



2.3] 


EXAMPLES 


27 


TABLE 1 

Number of Occurrences of the Digit 7 among tiie First 100 Groups of 100 
Digits Each in the Kendall-Smith Tables 


Thousand 

Hundred 

Average 

Frequency 

1 

2 

3 

4 

5 

6 

7 

8 

9 

10 

1 

12 

8 

5 

14 

12 

9 

10 

7 

10 

8 

0.095 

2 

15 

5 

8 

0 

12 

8 

5 

15 

3 

11 

.088 

3 

7 

8 

12 

13 

7 

10 

10 

8 

9 

11 

.095 

4 

14 

12 

8 

10 

10 

0 

9 

11 

10 

10 

.112 

5 

8 

10 

10 

12 

20 

5 

7 

0 

11 

6 

.095 

6 

7 

.15 

8 

8 

9 

8 

8 

7 

12 

17 

.099 

7 

8 

4 

5 

10 

7 

7 

11 

14 

7 

9 

.082 

8 

7 

5 

12 

10 

10 

’ 8 

8 

11 

0 

0 

.089 

9 

11 

7 

17 

8 

9 

14 

13 

14 

8 

10 

.111 

10 

12 

12 

10 

9 

10 

9 

11 

10 

11 

8 

.102 


1-10,000 .0908 


good or better than that in Table 1 can be expected only in about six 
out of ten cases. 

As a second example, we take pairs of random digits and the event 
that both digits of the pair are 7. The probability that a pair is 
(7, 7) is 1/100. Table 2 gives the observed frequencies for 5000 pairs 
(10,000 digits). 

(6) Random Sampling Numbers (Continued ). As an application of 
the theorem of section 2, we shall calculate the probability p that 5 
consecutive random digits are all different. There are 10 6 possible 
arrangements of which (10) 5 = 10*9-8-7-6 are without repetition. 
Hence 

( 10 ) 5 

(3.1) p = —^ = 0.3024. 

F 10 5 

One expects intuitively that in large mathematical tables having 
many decimal places the last five digits will have many properties of 
randomness. (In ordinary logarithmic and other smaller tables the 
tabular difference is nearly constant and the last digit therefore varies 



28 


COMBINATORIAL ANALYSIS 


[2.3 


TABLE 2 


Number of Occurrences of the Combination (7, 7) among the First 5000 
Pairs (10,000 Digits) in the Kendall-Smith Tables 


Group 
of 500 
Pairs 

1-100 

101-200 

201-300 

301-400 

401-500 

Total 

Average 

Frequency 

from 

Beginning 

1 

0 

1 

0 

1 

0 

2 

0.004 

2 

1 

1 

0 

0 

2 

4 

.006 

3 

0 

3 

1 

1 

1 

6 

.008 

4 

0 

2 

2 

2 

1 

7 

.0095 

5 

1 

1 

2 

0 

1 

5 

.0096 

6 

2 

1 

1 

0 

1 

5 

.0097 

7 

0 

0 

0 

1 

0 

1 

.0086 

8 

0 

3 

3 

1 

0 

7 

.0093 

9 

0 

3 

1 

1 

3 

8 

.0100 

10 

0 

1 

1 

0 

l 

3 

.0096 

11 

2 

0 

2 

0 

1 

5 

.0096 

12 

3 

2 

0 

0 

0 

5 

.0097 

13 

1 

« 2 

0 

1 

1 

5 

.0097 

14 

0 

1 

2 

1 

1 

5 

.0097 

15 

6 

0 

1 

0 

0 

6 

.0099 

16 

1 

0 

2 

1 

3 

7 

.0101 

17 

1 

0 

0 

0 

0 

1 

.0096 

18 

2 

1 

0 

0 

0 

3 

.0094 

19 

2 

1 

0 

0 

2 

5 

.0095 

20 

0 

0 

2 

0 

3 

5 

.0095 


regularly.) As an experiment, sixteen-place tables 4 were selected and 
entries were counted where the last five digits are all different. In 
the first twelve batches of 100 entries each, the frequencies of this 
event are as follows: 

0.30, 0.27, 0.30, 0.34, 0.26, 0.32, 

.37, .36, .26, .31, .36, .32. 

These numbers fluctuate around the value 0.3024, and small-sample 
theory shows that the magnitude of the fluctuations is well within the 

4 Tables of Probability Functions , vol. I, National Bureau of Standards, 1941. 



2.3] 


EXAMPLES 


29 


expected limits. The average frequency is 0.3142, which is rather close 
to the theoretical probability, 0.3024 [cf. Chapter 7, example (3.6)]. 

Consider next the number e = 2.71828_ The first 800 decimals 6 

form 160 groups of five digits each, which we arrange in sixteen batches 
of ten each. In these sixteen batches the numbers of groups in which 
all five digits are different are as follows: 


3, 1, 3, 4, 4, 1, 4, 4, 

4, 2, 3, 1, 5, 4, 6, 3. 


The frequencies oscillate, as they should, around the value 0.3024, 
and small-sample theory confirms that the magnitude of the fluctua¬ 
tions is not larger than should be expected. The overall frequency of 
our event in the 160 groups is 52/160 = 0.325, which is reasonably 
close to p = 0.3024. 

(c) Birthdays. The birthdays of r people form a sample of size r 
from the population of all days in the year. The years are not of equal 
length, and we know that the birth rates are not quite constant through¬ 
out the year. However, in a first approximation, we may take a random 
selection of people as equivalent to a random selection of birthdays. 
Furthermore, we shall, for simplicity, ignore the existence of leap years 
and shall consider random samples of r birthdays in a year of 365 days. 

Let us calculate the 'probability , p, that all r birthdays are different 
The number of arrangements is 365 r . If the r birthdays are different, 
they form a sample without replacement, and there exist (365) r such 
samples. Hence 


(3.2) 


(365 ) r 
365 r ’ 


This formula is not very suggestive. For numerical calculations it is preferable 
to write it in the form 


(3.3) 


V 





If r is small, we can neglect all cross products and have in a crude approximation 


( 3 . 4 ) 


V 


1+2 + ••• + (r - 1) 
365 


Kr - 1) 
730 


For r « 10 the correct value is p = 0.883 ..., while (3.4) gives the approximation 
0.877. 


For larger r we obtain a much better approximation by passing to logarithms. 
The Taylor expansion [cf. (6.9)1 shows that log (1 — r) « — x, provided x is small. 
Thus from (3.3) 

. 1 + 2 H-h (r - 1) r(r - 1) 

{ 6 . 0 ) log p «- -- 


365 


730 


6 IntermMiaire des recherches matMniatiques } vol. 2, 1946, p. 112. 



30 


COMBINATORIAL ANALYSIS 


[2.4 


the error term can be calculated from the remainder of the Taylor expansion, and 
it turns out to be approximately r 3 /6-365 2 . For r «= 30 the approximation (3.4) 
gives a negative value; formula (3.5) gives 0.3037 instead of the correct value 
p ** 0.294. For r < 40 the error in p is less than 1/12. 

How large should r be to have p « 1/2? We have to solve the equation log p = 
— log 2. The above consideration shows that, instead, we may solve r(r — l)/730 
= 0.7. The root is r = 22.6. With r = 22 people the probability that no two have 
the same birthday just exceeds 1/2, but for 23 people this probability is already smaller 
than 1/2. [Cf. example (4.c) of Chapter 6.] 

(d) Elevator. An elevator starts with r = 7 passengers and stops at 
n = 10 floors. What is the probability p that no two passengers leave 
at the same floor? To render the question precise, we assume that all 
arrangements of discharging the passengers have the same probability 
(which is presumably only a crude approximation). There exist 10 7 
arrangements, of which ( 10)7 are arrangements without repetition. 
Therefore p = 10“ 7 (10) 7 = (10-9-8-7-6-5-4) 10“ 7 = 0.00048. When 
the event was once observed, the occurrence was deemed remarkable 
and odds of 1000 to 1 were offered against a repetition. 

4. Partitions 

If the sample size equals the population size n, then a sample without 
repetitions is the same as an ordering of the n elements of the popula¬ 
tion. By the theorem of section 2, the number of such orderings is 
(ri) n = n(n — 1) ••• 3-2*1. Instead of (n) n we use the usual 
notation 

(4.1) n! = 1-2-3 - • • (n - 1 )-n. 

We found ^ 

Theorem 1. The number of different orderings of n elements is n\. 

Let r < n and consider the samples without repetition of size r 
from a population of n elements. There are (ri) r such samples. The 
elements in each can be ordered in r\ ways, which means that there 
exist r\ samples having the same elements. If the order is disregarded, 
these r! samples become indistinguishable, and the number of distin¬ 
guishable arrangements becomes (n) r /rl This number is known as 
the binomial coefficient 

. /»\ = (w)r = w(n - 1) • • • (n - r + 1) 

\r) r\ i .2 ... (r — 1 )-r 


Theorem 2. Out of n elements a group of size r < n can be selected 
/n\ 

in I 1 different ways (two groups are different if one contains an ele¬ 


ment not contained in the other). 



2.4] 


PARTITIONS 


31 


An alternative way of writing (4.2) is 

n\ 

r!(n — r)! 

This formula shows that 



(1.4) 



Now 

defining 



= 1 while 



(4.5) 


is meaningless. 



Formula (4.4) suggests 


This amounts to the same thing as defining 0! = 1. 


Examples. a) Bridge (cf. footnote on p. 9, Chapter 1). Let a 
hand of 13 cards be selected at random from a full deck. The order 
within a hand is irrelevant. The number of different hands is, there¬ 


fore, 



635,013,559,600. Let us now calculate the probability p 


that a hand contains all thirteen face values , assuming, of course, that all 
hands have the same probability. For each face value we have the free 
choice of a suit. The hand may be considered as a sample with repeti¬ 
tions of size 13 of the population of four suits. By the theorem of 
section 2 there exist 4 13 such samples, and hence 


p = 413 h - = 0.0001057 .... 

J (6) Poker (cf. footnote on p. 9, Chapter 1). The number of different 

= 2,598,960. To find the number of hands 

containing five different face values we note that these face values 

different ways. As in the preceding example 

we see that the corresponding suits can be selected in 4 5 ways. Hence, 
if the hand is chosen at random, the probability that the 5 cai'ds have 
five different face values is 


can be chosen 


-( 


13 

5 


hands at poker 


*0 



32 


COMBINATORIAL ANALYSIS 


[2.4 


(c) Each of the 48 states has two senators. We consider the events 
that in a committee of 48 senators chosen at random: (1) a given state 
is represented, (2) all states are represented. 

In the first case it is better to calculate the probability q of the com¬ 
plementary event, namely, that the given state is not represented. 
There are 96 senators, and 94 not from the given state. Hence, 


<1 = 



48 • 47 
96~95 9 


and the probability that the given state is represented is 1 — ^ 
= 0.75263.... Now the theorem of section 2 shows that a committee 
including one senator from each state can be chosen in 2 48 different 
ways. The probability that all states are included in the committee 

is, therefore, p = 2 48 -r- • Using Stirling’s formula (cf. section 7), 

it can be shown that p « « 4-10“ 14 . 


Theorem 3. Let r x , r 2 , be non-negative integers such that 

(4.6) r x + r 2 + • • • + r k = n. 


The number of ways in which n objects can be divided into k groups of 
which the first contains r x objects , the second r 2 objects , etc., is 


(4.7) 


n\ 

r x \r 2 \ • • • r k \ 


(Here, the order of the groups is essential, but no attention is paid to 
the order within the groups. The numbers (4.7) are called multinomial 
coefficients .) 

Proof. Consider first the case k = 2. To partition n objects into 
two groups is the same as to select r x objects to go into the first group; 
the r 2 = n — r x remaining objects form the second group. For k = 2 

the number of partitionings is therefore ^ ^ = n\/(r x \r 2 \). To per¬ 
form the partitioning in the general case we start by selecting the first 
group of size r x ; of the remaining (n — r x ) objects we select a group of 
size r 2 , etc. After forming the (k — l)st group there remain 
n — r x — r 2 — • • • — r k -1 = r k objects, which form the last group. 
We conclude that the number of partitions is: 




2.5] 


THE HYPERGEOMETRIC DISTRIBUTION 


33 


which reduces to (4.7) if all binomial coefficients are expressed in accord¬ 
ance with (4.3). 

Examples, ^(d) Bridge. The n — 52 cards are distributed among 
players with r x = r 2 = r 3 = r 4 = 13. Hence the number of different 
situations at a bridge table is: 


(4.9) 


52 ! 

-~ = (5.3645 .. .)10 28 . 

(13!) 4 


Let us now calculate the probability that each player has an ace. The 
four aces can be ordered in 4! = 24 ways, and each order represents 
one possibility of giving one ace to each player. The remaining 48 
cards can be distributed in (48!)(12!)“ 4 ways. Hence the required 
probability is 


24 * 


48!(13) 4 

52! 


0.105 .... 


(i e ) Dice. A throw of 12 dice can result in 6 12 different outcomes 
to all of which we attribute equal probabilities. The event that each 
face appears twice can occur in as many ways as 12 dice can be arranged 
in six groups of two each. Hence the probability of the event is 
12!/(2 6 - 6 12 ) = 0.003438.... 


5. The Hypergeometric Distribution 

Many combinatorial problems can be reduced to the following form. 
In a population of n elements n x are red and n 2 = n — n x are black. 
A group of r elements is chosen at random (without replacement and 
without regard to order). We seek the probability qk that the group 
so chosen will contain exactly fc red elements. Here k can be any 
integer between zero and n x or r, whichever is smaller. 

To find qk, we note that the chosen group contains k red and r — k 


/wA 

black elements. The red ones can be chosen in f 1 different ways 

( n — nA 

] ways. Since any choice of k red 
r — k / 


elements may be combined with any choice of black ones, we find 



(5.1) 


qk = 



34 


COMBINATORIAL ANALYSIS 


[2.5 


The system of probabilities so defined is called the hypergeometric dis¬ 
tribution. 6 Using formula (4.3), it is possible to rewrite (5.1) in the form 



Note. The probabilities qk are defined only for k not exceeding r or 
Wi. However, from the definition (4.2) it follows that = 0 when¬ 
ever b > a. Therefore, formulas (5.1) and (5.2) give qk = 0 if either 
k > ni or k > r. Accordingly, the definitions (5.1) and (5.2) may be 
used for all k > 0, provided that the relation qu — 0 is interpreted as 
impossibility. 


Examples, (a) Quality Inspection. In industrial quality control, 
lots of size n are subjected to sampling inspection. The defective items 
in the lot play the role of “red” elements. Their number rq is, of 
course, unknown. A sample of size r is taken, and the number k of 
defective items in it is determined. Formula (5.1) then permits us 
to draw inferences as to the likely magnitude of n x ; this is a typical 
problem of statistical estimation which is beyond the scope of the 
present book. 

(6) In example (4.c), the population consists of n = 96 senators of 
whom n\ = 2 represent the given state (are “red”). A group of 
r = 48 senators is chosen at random. It may include k = 0, 1, or 2 
senators from the given state. From (5.2) we find [remembering that 



1 by (4.5)] 


So 


48-47 

96-95 


0.24737 


Si 


48 

— = 0.50527 
95 


S2 


48-47 

-= 0.24737 .... 

96-95 


The value qo was obtained in a different way in example (4.c). 

6 The name is explained by the fact that the generating function (cf. Chapter 11) 
of \qk\ can be expressed in terms of hypergeometric functions. 



2.5] 


THE HYPERGEOMETRIC DISTRIBUTION 


35 


(c) Distribution of Aces among r Bridge Cards. Let po(r), p x (r), 
• • •> PA r ) denote the probabilities that among r bridge cards drawn at 
random there are 0, • • •, 4 aces, respectively. For r — 5 we obtain 
the probability that a poker hand (cf. footnote on p. 9 ) contains 
0, 1, • • •, 4 aces. We get Pk(r) from (5.2) with n = 52, n x = 4. A 
simple calculation shows that 


(5.3) 


Po(r ) 
Pi O’) 

V-Ax) 

pAr) 
pAr ) 


(5 2 - r )4 
”'(52) 4 ’ 

4r(52 — r ) 3 
( 52)4 , 

0r(r — 1) (.52 — r ) 2 

"( 52)7 ’ 

4 r(r — 1)(r — 2) (52 — r) 
^2~ 

(r)4 

( 52)4 '’ 


Table 3 gives the values Pk(r) for all possible combinations of k and r. 
In using it, note that the probability of having 0 , 1 , • * *, 4 aces among 
r cards is the same as the probability of having 4, 3, • * •, 0 aces among 
the remaining 52 — r cards. In other words, we have 


(5.4) poO’) = P4(52 - r), p,(r) = p 3 (52 - r), p 2 (r) = p 2 (52 - ?•) 


as can be verified from (5.3). 

( d ) A Waiting Time Problem. We shall for a moment deviate from 
our path in order to discuss an alternative interpretation of the prob¬ 
abilities of Table 3. Suppose that card after card is drawn from a 
deck of bridge cards and that all 52! possible orders of picking have 
equal probabilities. Then p k (r) is the probability that among the first 
r cards there will be exactly k aces. In particular, p 0 (?) is the prob¬ 
ability that no ace turns up in the first r drawings or that more than r 
are required for the first ace to turn up. In a similar way p 0 (r) + pi(r) 
is the probability that in r drawings 0 or 1 ace turns up, which means 
that it takes more than r drawings before the second ace turns up. 
Also, p 0 (r) + pi(r) + p 2 (r) becomes the probability that the third 
ace turns up sometime after the rth drawing. 

For each h < 4 there exists a number r k of drawings such that the 
probabilities that the Bh ace will turn up before or after the r*th 
drawing will both be closest to 1/2. This number r k is called the 



COMBINATORIAL ANALYSIS 


[2.5 


TABLE 3 
Probabilities (5.3) 


r 

Po (r) 

Pi(r) 

p 2 (r) 

P3(r) 

Pi(r) 


1 

0.92308 

0.07692 

0.00000 

0.00000 

0.00000 

51 

2 

.85068 

.14480 

.00452 

.00000 

.00000 

50 

3 

.78262 

.20416 

.01303 

.00018 

.00000 

49 

4 

.71874 

.25555 

.02500 

.00071 

.00000 

48 

5 

.65884 ’ 

.29947 

.03993 

.00174 

.00002 

47 

6 

.60277 

.33643 

.05735 

.00340 

.00006 

46 

7 

.55036 

.36690 

.07679 

.00582 

.00013 

45 

8 

.50144 

.39136 

.09784 

.00910 

.00026 

44 

9 

.45585 

.41027 

.12008 

.01334 

.00047 

43 

10 

.41344 

.42405 

.14312 

.01862 

.00078 

42 

11 

.37407 

.43313 

.16659 

.02499 

.00122 

41 

12 

.33757 

.43794 

.19016 

.03250 

.00183 

40 

13 

.30382 

.43885 

.21349 

.04120 

.00264 

39 

14 

.27266 

.43625 

.23630 

.05109 

.00370 

38 

15 

.24396 

.43051 

.25831 

.06218 

.00504 

37 

16 

.21758 

.42198 

.27925 

.07447 

.00672 

36 

17 

.19341 

.41099 

.29890 

.08791 

.00879 

35 

18 

.17130 

.39786 

.31705 

.10248 

.01130 

34 

19 

.15115 

.38291 

.33350 

.11812 

.01432 

33 

20 

.13283 

.36642 

.34810 

.13475 

.01790 

32 

21 

.11622 

.34867 

.36070 

.15229 

.02211 

31 

22 

.10123 

.32993 

.37117 

.17065 

.02702 

30 

23 

.08773 

.31043 

.37942 

.18971 

.03271 

29 

24 

.07563 

.29042 

.38537 

.20933 

.03925 

28 

25 

.06483 

.27011 

.38896 

.22938 

.04673 

27 

26 

.05522 

.24970 

.39016 

.24970 

.05522 

26 


Pi(r) 

Ps(r) 

p 2 (r) 

Pi(r) 

Pq(t) 

r 


pk(r) is the probability that among r bridge cards selected at random there will 
be exactly k aces. 


median for the kth ace. From Table 3 we find that the four medians 
are 8, 20, 32, 44, respectively. 

Define now a new game by the following rule: “Cards are drawn as 
long as necessary for the first ace to turn up.” Note that this rule 
defines a new sample space. We get a picture of the sample points if 
in each of the 52! different orderings of the deck of cards we remove 
all cards which come after the first ace. Naturally, several orderings 




2.5] *THE HYPERGEOMETRIC DISTRIBUTION 37 

will lead to the same new sample point, and we have here the first 
example of a sample space in which we attribute different probabilities 
to the various points. 

(e) A Sampling Problem. Suppose that 1000 fish caught in a lake 
are marked by red spots and released. After a while a new catch of 
1000 fish is made, and it is found that 100 among them have red spots. 
What conclusions can be drawn concerning the number of fish in the 
lake? This is a typical problem of statistical estimation. It would lead 
us too far to describe the various methods that a modern statistician 
might use, but we shall show how the hypergeometric distribution gives 
us a clue to the solution of the problem. We assume naturally that the 
two catches may be considered as random samples from the population 
of all fish in the lake. (In practice this assumption excludes situations 
where the two catches are made at one locality and within a short 
time.) We also suppose that the number of fish in the lake does not 
change between the two catches. 

We generalize the problem by admitting arbitrary sample sizes. Let 

n — the (unknown) number of fish in the lake. 

ni = the number of fish in the first catch. They play the role of 
red balls. 

r — the number of fish in the second catch. 

k = the number of red fish in the second catch. 

Qk(n) = the probability that the second catch contains exactly k red 
fish. 

In this formulation it is rather obvious that Qk(n) is given by (5.1). 
In practice n if r, and k can be observed, but n is unknown. Note, 
incidentally, that n is a fixed number which in no way depends on 
chance. It is, therefore, meaningless to ask for the probability that n 
is greater than, say, 6000. We know that n x + r — k different fish 
were caught, and therefore n > n x + r — k. This is all that can be 
said with certainty. In our example we had n x = r — 1000 and 
k = 100, and it is conceivable that the lake contains only 1900 fish. 
However, starting from this hypothesis, we are led to the conclusion 
that an event of a fantastically small probability has occurred. In 
fact, the probability that two samples of size 1000 will exhaust an 
entire population of 1900 fish is, by (5.1), 

/1000\/900\ 

\100/\900/_ (1000!) 2 
7l000\ 10011900! 

VlOOO/ 



COMBINATORIAL ANALYSIS 


[2.5 


Stirling's formula (cf. section 7) shows this probability to be of the 
order of magnitude 10~ 430 , and we therefore reject our hypothesis as 
unreasonable. A similar reasoning would induce us to reject the 
hypothesis that n is very large, say, a million. This consideration leads 
us to seek the particular value of n for which qk(n) attains its largest 
value, since for that n our observation would have the greatest prob¬ 
ability. For any particular set of observations n x , r , k the value of n 
for which qk(n) is largest is denoted by h and is called the maximum 
likelihood estimate of n. This notion was introduced by R. A. Fisher. 
To find h consider the ratio 


(5.5) 


Qk(n) _ (n - n x ){n - r) 
qic(n — 1) (n — Ui — r + k)n 


A simple calculation shows that this ratio is greater than or smaller 
than unity, according as nk < n x r or nk > 7i x r. This means that with 
increasing n the sequence qu{n) first increases and then decreases; it 
reaches its maximum when n is the largest integer short of rtir/k, so 
that 


(5.6) 



In our particular example the maximum likelihood estimate of the 
number of fish is n = 10,000. 

The true number n may be larger or smaller, and we may ask for 
limits within which we may reasonably expect n to lie. For this 
purpose let us test the hypothesis that n is smaller than 8500. We sub¬ 
stitute in (5.1) n = 8500, n x = r = 1000, and calculate the probability 
that the second sample contains 100 or fewer red fish. This prob¬ 
ability is x = qo + qi + • • • + <7i 0 o* A direct evaluation is cumber¬ 
some, but using the normal approximation of Chapter 7, we find easily 
that x « 0.04. Similarly, if n = 12,000, the probability that the second 
sample contains 100 or more red fish is about 0.03. These figures 
would justify a bet that the true number n of fish lies somewhere 
between 8500 and 12,000. There exist other ways of formulating these 
conclusions and other methods of estimation, but we do not propose to 
discuss the details. 


From the definition of the probabilities qu it follows that qo + qi 
+ q 2 + • •• =1. Formula (5.2) therefore implies that for any positive 
integers n, n x , and r 



2.5] 


THE HYPERGEOMETRIC DISTRIBUTION 


™ O(V) + C)C:0 

+ oc:;) + ' + (;,)(v)-(i> 

This identity is frequently useful. We have proved it only for positive 
integers n and r, but it holds true without this restriction for arbitrary 
positive or negative numbers n and r (it is meaningless if is not a 
positive integer). (An indication of two proofs is given in section 9, 
problems 5 and 6; for application cf. problems 7-13.) 

The hypergeometric distribution can easily be generalized to the 
case where the original population of size n contains several classes of 
elements. For example, let the population contain three classes of 
sizes n i, n 2 , and n — n\ — n 2 , respectively. If a sample of size r is 
taken, the probability that it contains k x elements of the first, k 2 
elements of the second, and r — k x — k 2 elements of the last class is, 
by analogy with (5.1), 



It is, of course, necessary that k x < n x , k 2 < n 2 , and r — k x — k 2 
< n — n i — n 2 . [For further properties of the hypergeomctric distri¬ 
bution cf. section 8, problem 42; Chapter 6, example (4.gr); Chapter 7, 
problem 10; and Chapter 9, example (5.e) and problem 11.J 

Example. (/) Bridge. The population of 52 cards consists of four 
classes, each of thirteen elements. The probability that a hand of 13 
cards consists of five spades, four hearts, three diamonds, and one club 
is 

0000 

6 


The probability that a hand contains 5 cards of some suit, 4 of another, 
3 of a third, and 1 of the last suit is 4! times as large, or 0.1293 — 



40 COMBINATORIAL ANALYSIS [ 2.6 

6. Binomial Coefficients 

, defined in (2.1) and (4.2), have been 

defined only if n and r are positive integers, but it is convenient to 
extend their definition. Since r denotes the number of factors, it 
must be an integer. However, the number 


( 6 . 1 ) 


(. x) r = x{x — 1) • • • (x — r + 1) 


is well defined for all real x provided only that r is a positive integer. 
For r = 0 we shall put (x) 0 = 1. Then 


(6 . 2) 


x(x — 1) • • • (x — r + 1) 


r! 


defines the binomial coefficients for all values of x and all positive integers r. 

/x\ 

J — 1 and 0! = 1. For negative integers 


For r = Owe put , as in (4.5), 
r we put 

(6.3) 


0 - 

We shall never use the symbol ^ ^ if r is not an integer . 
With this definition we have, for example, 

(-;)->• o- a 2 )-' 


(r < 0). 


=! I zi z! i 

V 4 / 2 2 2 2 4! 


(T)-T 


-1 -4 -7 1 

3~ ’ T ' 6 


3 

128’ 

14 

" 81* 


Three important properties will be used in the sequel. First, for 
any -positive integer n 

(6.4) ( ^ = 0 if either r > n or r < 0. 

Second, for any number x and any integer r 

o iMxry 


(6.5) 



2.7] 


41 


STIRLINGS FORMULA 


These relations are easily verified from the definition. The third rela¬ 
tion will be assumed known from calculus textbooks: for any number 
a and all values — 1 < t < 1 we have Newton's binomial formula 

(6.6) (l + <)° =! + (“)* + C)* a + C)* 8+ "" 

If a is a positive integer, all terms to the right containing powers higher 
than t a vanish automatically; then the formula is correct for all t. If a 
is not a positive integer, the right side represents an infinite series. 

As an example consider the case a — — 1. We find easily 

(6.7) (~ 1 ) = C—l )r 


whence (6.6) reduces to the geometric series 

(6.8) —— = 1 - t + t 2 - t 3 + t* - 4-.. 

1 + t 

an expansion which is valid for — 1 < t < 1. Integrating (6.8), we 
obtain another formula which will be useful in the sequel, namely, the 
Taylor expansion of the natural logarithm 

(6.9) log (1 + t) - t - \t 2 + -^ 3 - i< 4 +.... 

This expansion is again valid for — 1 < t < 1. It has been used in 
(3.5). 

Many other relations can be deduced from (6.6). For example, for 
any positive integer n we find, letting t = 1, 



and letting t = —1, 


( 6 . 11 ) 




= 0 . 


These two identities will be used in the sequel. (For further identities 
see the problems of section 9.) 


7. Stirling^ Formula 

A direct numerical evaluation of expressions involving n! is usually 
inexpedient, and it is therefore important to find simple approximations 
to nl It is clear that n n increases faster than ft!, and we shall obtain 



COMBINATORIAL ANALYSIS 


42 


[2.7 


information concerning n! from a study of the ratio a n = n!/n n . 
Obviously 


(7.1) 



Now the basis of the natural logarithms e = 2.71828 ... is defined as 
the limit of the denominator in (7.1), so that a n +i/a n —> 1/e. Thus 
for large n the sequence a n behaves essentially like a geometric sequence 
with ratio e"" 1 . This suggests replacing n n by (n/e) n and studying the 
sequence 


(7.2) 



a n e n . 


To investigate the growth of b n , we first pass to logarithms, then use 
(7.1) and the Taylor expansion (6.9) to find 

(7.3) log = 1 + log = 1 - n log ("l + - ) 

Ofi dn \ w 

1 1 
2 n 3 n 2 

The right-hand member is positive, so that & n+1 /fc n is greater than 1. 
Hence the sequence {b n } increases, which means that n\ grows faster 
than {n/e) n . On the other hand, if we replace the exponent by n + 1/2 
and put 



then (7.3) is to be replaced by 


(7 .5, l° s X ! - 1 -(” + i)l 0g ( 1+ ;) 

1 1 
12n 2 12n 3 


The right side is negative for all n, and hence 0 n+1 < fi n . We see, 
therefore, that the sequence p n decreases so that a limit p = lim @ n 
exists. From (7.4) we have then 



fi. 


(7.6) 


n\ 



2.7] 


STIRLING'S FORMULA 


43 


For a satisfactory approximation to n! we now require only the numer¬ 
ical value of 0. In principle /3 could be calculated to any desired number 
of decimals by computing the left side in (7.6) for some n sufficiently 
large. Fortunately we can avoid this uninspiring procedure, since the 
theory developed in Chapter 7 will show that ft = (2tc) 1a . Up to 
then the exact value of /3 will play no role, and we are free to postpone 
the proof that 13 — (27 re)'' 2 (cf. Chapter 7, section 2). We can now 
rewrite (7.6) in the final form known as 

Stirling's Formula 

(7.7) n!~ (27r)‘V +H c~ n . 

Here the sign ~ is used to indicate that the ratio of the two sides tends to 1. 

It is true that the difference of the two sides in (7.7) increases over 
all bounds, but it is the 'percentage error which really matters. The 
percentage error decreases steadily (since f3 n decreases), and Table 4 
shows that Stirling’s approximation is remarkably accurate even for 
small n. For n = 5 the error is 2 per cent, and for n > 9 it has dropped 
below 1 per cent. 


TABLE 4 


Stirling’s Approximations to n\ 




Approximation 

Per Cent 

Approximation 

Per Cen 

n 

n! 

from (7.7) 

Error 

from (7.9) 

Error 

1 

1 

0.922 

8 

1.0023 

0.2 

2 

2 

1.919 

4 

2.0007 

.04 

5 

120 

118.019 

2 

120.01 

.01 

10 

(3.02880)10° 

(3.5980 . 

. ,)10 8 

0.8 

(3.02881. ..)10 6 


100 

(9.3320 ...)10 167 

(9.3219 . 

. .) 10 157 

0.08 

(9.3320 ...)10 167 



We saw that $ n decreases so that p n > fi- This implies that Stirling's approxima¬ 
tion (7.7) somewhat underestimates n\. It is easy to improve on (7.7) and, in particu¬ 
lar, to find alternative approximations which will overestimate nl. For that purpose 
it suffices to modify the definition (7.4) of 0 n so that the right side of (7.5) will be 
positive. Now, if we put y n - j8„e~ 1/12n , then y n -> p and 


(7.8) 


, 7»+l , Pn +1 

log-- log- -- + 

Yn Pn 


1 

12n(n + 1) 


A comparison with (7.5) shows that the right, side in (7.8) starts with terms of the 
magnitude n” 4 , and we can, therefore, expect that 

?i! — (2 tt )>V + ^e-* +1/(12n) 


(7.9) 





44 


COMBINATORIAL ANALYSIS 


[2.8 


will give an even better approximation to nl than (7.7). The last two columns of 
Table 4 confirm this assumption. Even for n = 2 formula (7.9) yields an error of 
less than 4 hundredths of a per cent. An additional feature of (7.9) is that it always 
overestimates nl. To prove this we have only to show that y n increases or that (7.8) 
is positive. Note that (7.8) is symmetric in n and n + 1, so that it is best to express 
the formula in terms of the mean p = n + %• Now » = m ~ and n + 1 =* 
P + Hj and, using (7.5), we can rewrite (7.8) in the form 


(7.10) 


log 2^±l = 1 _ M jj og (i + i.) _ lQ g (! _ JL) } + _i 


12 m 2 1 - 1/(4 m*) 


Formulas (6.8) and (6.9) provide convenient expansions for the last fraction and 
the two logarithms. We get 


(7.11) 


log ^ = 1 
7 n 


111 1 

M 1m + iv + 80 m 6 + 448m 7 "* I 


+ 


I; 

bl 


1 1 

+ v + 16 ^ + * 


-L. + -L- 

120 m 4 33 Gm 8 9 


-f.... 


The coefficient of m 2k is 2 2k ^ — —-—> 0, and hence all terms are positive. 

\3 2k -J“ 1 / 

This accomplishes the proof of the theorem: the right side in (7.9) overestimates 
nl, while the right side in (7.7) underestimates n\. 

Formula (7.9) is called Stirling's second-term approximation to nl In the same 
way we see that a third approximation would add a term c“ 1/36 ° m3 , etc. 


8. Problems for Solution: Combinatorial 

Note: Assume in each case that all arrangements have the same probability. 

1. How many different sets of initials can be formed if every person has (a) 
exactly two, (6) at most two, (c) at most three, given names? 

2. In how many ways can two rooks of different colors be put on a chessboard 
so that they can take each other? 

3. Letters in the Morse alphabet are formed by a succession of dashes and dots 
with repetitions permitted. How many letters is it possible to form with ten 
symbols or less? 

4. Each domino piece is marked by two numbers. The pieces are symmetrical 
so that the number-pair is not ordered. How many different pieces can be made 
using the numbers 1, 2, • • *, n? 

5. The numbers 1,2, • • •, n are arranged in random order. Find the probability 
that the digits (a) 1 and 2, (b) 1, 2, and 3, appear as neighbors in the order named. 

6. Find the probability that among three random digits there occur 2, 1, or 0 
repetitions. 

7. Do problem 6 for the case of four random digits. 

8. Find the probabilities p r that in a sample of r random digits no two are equal. 
Estimate the numerical value of pio, using Stirling’s formula. , 

9. What is the probability that among k random digits (a) 0 does not appear; 
(6) 1 does not appear; (c) neither 0 nor 1 appears; (d) at least one of the two digits 



2.8] 


PROBLEMS: COMBINATORIAL 


45 


0 and 1 does not appear? Let A and B represent the events in (a) and (b). Express 
the other events in terms of A and B. 

10. What is the probability that among k random digits 0 appears (a) exactly 
three times; (6) three times or less? 

11. Suppose that each of n sticks is broken into one long and one short part. 
The 2 n parts are arranged into n pairs from which new sticks are formed. Find 
the probability (a) that the parts will be joined in the original order, (6) that all 
long parts are paired with short parts. 7 

12. Testing a statistical hypothesis. A Cornell professor got a ticket twelve times 
for illegal overnight parking. All twelve tickets were given either Tuesdays or 
Thursdays. Find the probability of this event. (Was his renting a garage only 
for Tuesdays and Thursdays justified?) 

13. Continuation. Of twelve police tickets none was given on Sunday. Is this 
evidence that no tickets are given on Sundays? 

14. From a population of 100 people a sample of size n is taken. Find the proba¬ 
bility that none of r given people will be included in the sample, assuming the 
sampling to be (a) without, (b) with replacement. Compare the numerical values 
for n — r — 3 and n — r = 10. 

15. A box contains 90 good and 10 defective screws. If 10 screws are used, what 
is the probability that none is defective? 

16. From the population of five symbols a, b , r, d, c, a sample of size 25 is taken. 
Find the probability that the sample will contain five symbols of each kind. Check 
the result in tables of random numbers, 8 identifying the digits 0 and 1 with a, 
the digits 2 and 3 with 6, etc. 

17. If n men, among whom are A and B, stand in a row, what is the probability 
that there w ill be exactly r men bet ween A and B? If they stand in a ring instead 
of in a row, show that the probability is independent of r and hence 1 /(» — 1). 

18. What is the probability that 2 throw s with 3 dice each will show the same 
configuration if, (a) the dice are distinguishable, (6) they are not. 

19. Show that it is more probable to get at least one ace with 4 dice than at least 
one double ace in 24 throw’s of 2 dice. (The answer is known as de M6r6’s paradox. 
Chevalier de M6r<$, a gambler, thought that the two probabilities ought to be 
equal and blamed mathematics for his losses.) 

20. How many dice have to be thrown to render the probability of no ace less 
than 1/3? 

21. What is the probability that the birthdays of twelve people will fall in 12 
different calendar months? (Assume equal probabilities for the 12 months.) 

22. What is the probability that the birthdays of six people will fall in exactly 
2 calendar months? 

23. Given 30 people, find the probability that among the 12 months there are 
6 containing tw r o, and 6 containing three, birthdays. 

7 When cells are exposed to harmful radiation, some chromosomes break and 
play the role of our “sticks.” The “long” side is the one containing the so-called 
centromere. If two “long” or two “short” parts unite, the cell dies. Cf. D. G. 
Catcheside, The Effect of X-ray Dosage upon the Frequency of Induced Structural 
Changes in the Chromosomes of Drosophila Melanogaster , Journal of Genetics , vol. 
36 (1938), pp. 307-320. 

8 They are occasionally extraordinarily obliging: cf. J. A. Greenwood and E. E. 
Stuart, Review of Dr. Feller’s Critique, Journal for Parapsychology , vol. 4 (1940),. 
pp. 298-319, in particular p. 306. 



46 


COMBINATORIAL ANALYSIS 


[2.8 


24. A closet contains n pairs of shoes. If 2r shoes are chosen at random (with 
2r < n), what is the probability that there will be no complete pair among them? 

25. In the preceding problem find the probabilities that among the 2r shoes 
there will be (a) exactly one complete pair, (b) exactly two complete pairs. 

26. A group of 2 N boys and 2 N girls is divided into two equal groups. Find the 
probability p that each group will be equally divided into boys and girls. Estimate 
p, using Stirling’s formula. 

27. Find the probability that a hand of 13 bridge cards contains exactly k red 
cards, k = 0, 1, • • •, 13. 

28. Find the probability that a hand of 13 bridge cards contains exactly k 
spades. 

29. In bridge, prove that the probability p of West’s receiving exactly k aces is 
the same as the probability that an arbitrary hand of 13 cards contains exactly k 
aces. (This is intuitively clear. Note, however, that the two probabilities refer 
to two different experiments, since in the second case 13 cards are chosen at random 
and in the first case all 52 are distributed.) 

30. The probability that in a bridge game East receives m and South n spades is 
the same as the probability that of two hands of 13 cards each, drawn at random 
from a deck of bridge cards, the first contains m and the second n spades. 

31. What is the probability that the bridge hands of North and South together 
contain exactly k aces, where k = 0, 1, 2, 3, 4? 

32. Let a, b, c, d be four non-negative integers such that a + b + c + d = 13. 
Find the probability p(a, 6, c, d) that in a bridge game the players North, East, 
South, West have a, b, c, d spades, respectively. 

33. Using the result of problem 32, find the probability that some player receives 
a, another 6, a third c, and the last d, spades if (a) a = 5, b = 4, c = 3, d = 1; 
(&) a = b = c - 4, d = 1; (c) a - b - 4, c - 3, d * 2. 

(Note that the three cases are essentially different.) 

34. Let a, 6, c, d be integers with a-f6 + c+ of = 13. Find the probability 
q(a, b, c y d) that a hand at bridge will consist of a spades, b hearts, c diamonds, and 
d clubs. 

35. Distribution of aces at bridge. For every possible combination of integers 
a t by Cy d with a + b 4* c -f ^ = 4, find the probability that one player will have a 
aces, another b aces, a third c aces, and the last d aces. 

36. Find the probability that each of two hands contains exactly k aces if the 
two hands are composed of r bridge cards each, and are drawn (a) from the same 
deck, (6) from two decks. 

37. Show that when r = 13 the probability in part (a) of problem 36 is the proba¬ 
bility that two preassigned bridge players receive exactly k aces each. 

38. Find the probability for a poker hand to be a 9 

(a) Royal flush (ten, jack, queen, king, ace in a single suit). 

( b) Straight flush (any five in sequence in a single suit). 

(c) Four of a kind (four cards of equal face values). 

(d) Full house (one pair and one triple of cards with equal face values). 

(e) Flush (five cards in a single suit). 

9 For the definition cf. footnote on p. 9. Note that our definition is a simpli¬ 
fication of the usual one: for example, what we call flush can at the same time be 
a straight flush. In practice such a hand would be classified as straight flush. 
The difference of the probabilities ( e ) minus (b) is the probability of a flush in 
the gambler’s sense of the word. 



2.9] 


PROBLEMS: BINOMIAL COEFFICIENTS 


47 


(/) Straight (five cards in sequence regardless of the suit).- 

(g) Three of a kind (three equal face values plus two extra cards). 

(h) Two pairs (two pairs of equal face values plus one other card). 

(i) One pair (one pair of equal face values plus three different cards). 

39. In the example (3 .d) the elevator starts with seven passengers and stops at 
ten floors. The various arrangements of discharge may be denoted by symbols like 
(3, 2, 2), to be interpreted as the event that three passengers leave together at a cer¬ 
tain floor, two other passengers at another floor, and the last two at still another 
floor. Find the probabilities of the fifteen possible arrangements ranging from 
(7) to (1,1, 1,1, I, 1,1). 

40. Bridge-bingo. A deck of 52 bridge cards is dealt in the usual way to four 
players. The cards of the deck are then called in random order, and whichever 
player has the card removes it from his hand. The player who is first without cards 
wins. Let pi(r), P2M, Pair) be the probabilities that, after the rth card is called, 
one, two, or three preassigned players are without cards. Calculate these proba¬ 
bilities. Give numerical values. 

41. A population of n elements includes np red ones and nq black ones (p + q = 1)* 
A random sample of size r is taken with replacement. Show that the probability 
of its including exactly k red elements is 

(*> 


42. A limit theorem for the hypergeometric distribution. If n is large and n\/n — p, 
then the probability qu given by (5.1) and (5.2) is close to the expression (*) of the 
preceding problem. Mon; precisely, 





<qk < 


0 "*«'■* 



A comparison of this and the preceding problem shows: for large populations there 
is practically no difference between sampling will} or without replacement [cf. Chapter 
b, example (20)]. 


9. Problems for Solution: Binomial Coefficients and Stirling’s Formula 

Note: All following identities are of interest and frequently used. 

1. For any a > 0 




Prove this directly and also by differentiation of the geometric series 
(1 - x)~\ 


2. Prove that 
(9.2) 



= 


3. Prove that 


wG)G)-G)C:!) + G)C:D--OCoV- 



48 


COMBINATORIAL ANALYSIS 


[2.9 


4. For integral n > 2 

0+*0+*G)+•••—. 

<»•« ©-©+•©-+•-°. 

2 ' 1 (2) + 3 2 (3) + 4-3 (4) + - n(n ~ 1)2B “ 2 - 

(Hint: Use the binomial formula.) 

5. In section 5 we remarked that the terms of the hypergeometric distribution 
should add to unity. This amounts to saying that for any positive integers a, b , n, 

GKM:)(,.-,)X)CX••+©©-(■:*)• 

Prove this by induction. [Hint: Prove first that (9.5) holds for a — 1 and all b.] 

6. Continuation. By a comparison of the coefficients of t n on both sides of 

(9.6) (1 + *)°(1 + t) b - (1 + t) a + h 

prove more generally that (9.5) is true for arbitrary numbers a, b (and integral ri). 

7. Using (9.5), prove that 




for all integral a, 6, and r. 

8. Using (9.5), prove that 




9. Using (9.8), prove that 


Of_ _ /2n\ 2 

- r)! 2 \n) 


.to W)*(» - r )! 2 V 

10. Prove the identity for integers 0 < a < b 




IHint : Use first (9.1), then (9.5), then again (9.1). Alternatively, compare 
coefficients of f° _1 in (1 — t) a ( 1 — t)~ b ~ 2 = (1 — t) a ~ b ~ 2 .] 

11. Using (9.5), prove that 

<»■”> scxno+n-rT- 1 )- 

[Hint: Apply (9.1) back and forth.] 

12. Prove that for 0 < k < n 

Q-cxxx-xh'-er)- 


( 9 . 12 ) 



2.9] 


PROBLEMS: BINOMIAL COEFFICIENTS 


49 


[Hint: Compare the coefficients of t k on either side of (1 4 t) n ( 1 4 0~ l = 
(1 4 t) n ~ l . Cf. problem 6.] 

13. From (9.12) deduce 


(9.13) 


c)-c;,)+c;,)--=(::!)■ 


14. Prove by induction on a that for integers a > r > 0 

;?:cn 

15. Prove by induction 

(MS (“) 1 - Q i + (”) 1 - + • ■. + (-U- (”) i 


1 1 

= !+- + -+• 


1 

• H— 

n 


[Hint: Remember (6.11).] 

16. Show that for any positive integer m 


(9.16) 


_ jfi ; 

(.'• + ?/ + z) m = H x“y b z c 
alblrl 


where the summation extends over all non-negative integers a , 6, c, such that 
a 4 6 + c — m. 

17. Using Stirling’s formula (7.7), prove that 
(9.17) ( 2 ”) ~ (m)-' A 2- n . 


18. Prove that for any positive integers a and 6 


(a 4 1)(« 4 2) • • • ( a 4 w) ^ 6! 6 

(6 4 1) (6 4 2) • • • (H n) ~ a! n 


19. The gamma function is defined by 


(9.19) 



dz 


where x > 0. Show that r(:c) ~ (2t r) Vl e~ x x x ~ [Note that if a: = n is an integer, 

r(n) ~ (n — 1)!.] 

.20. Let a and r be arbitrary positive numbers and n a positive integer. Show 
that 

(9.20) q(q 4 r)(q 4 2r) • • • (a 4 nr) ~ Cr n+l n n+(a ^ + ^e~ n . 

[The constant C is equal to (2ir)^/Y(a/r).] 

21. Using the results of the preceding problem, show that 

q(q 4 r){a 4 2r) ••♦ (a 4 nr ) ~ V{b/r)_ ^ (a _ b)/r 
6(6 4 r)(6 4 2r) ••"•(64 nr) ~ rfo/r) 


(9.21) 



50 


COMBINATORIAL ANALYSIS 


22. Prove the following alternative form of Stirling’s formula: 

(9.22) n\ ~ (2w)' A (n + y 2 ) n+ *e ~< B + H>. 

23. Continuation. Using the method of the text, show that 
(2w)' A (n + i^)»+H e -<»+H)-V24(»+H) < n \ < (2ir)' A (n + y 2 ) n+A t 

24. Extending Stirling's formula, prove that 

(9.24) «!~(2.)^»+Hexpj-«+ ;i ]-- i ^+...|. 

25. Prove similarly 

(9.25) nl ~ (2ir)' A (n + i)” + % \-(n+~)- 

+ 




+ • 


2880(n + y 2 f 



* CHAPTER 3 


THE SIMPLEST OCCUPANCY AND ORDERING PROBLEMS 

This chapter represents a digression from the main path of the book, 
and a knowledge of its results is not required in the sequel. We shall 
treat a few special problems, partly for their intrinsic interest, and 
partly for their importance in applications. Among these, the Bose 
and Fermi statistics provide an instructive illustration of the physicist's 
use of probability models. 

1. Combinatorial Lemmas 

We want to study random distributions of r objects in n cells or 
compartments, with no restrictions imposed on the number of objects 
in any particular cell. Eacli object can be placed in n different ways, 
and therefore (cf. Chapter 2, section 2) the number of different distri¬ 
butions is n r . However, if the objects are indistinguishable, then an 
exchange of two objects is not observable, and the number of distin¬ 
guishable distributions is smaller than n r . 

Examples, (a) With two objects and two cells we have four arrange¬ 
ments which may conveniently be represented by ( AB | —), (A | B), 
(B | A), and (—| AB). However, with two A's we have only the 
three distinguishable arrangements (AA | —), (A | A), and (— | AA), 
which are preferably represented by (2 | 0), (1 | 1), and (0 | 2). For 
three indistinguishable things and two cells we have the four distri¬ 
butions (3 | 0), (2 | 1), (1 | 2), (0 | 3). The first and the last correspond 
to unique arrangements of three distinct things, while (2 | 1), and 
similarly (1 | 2), may stand for any of the three arrangements (AB | C), 
(AC | £), (BC | A). 

(b) The six faces of a die may be interpreted as cells; a throw with r 
dice then puts each die into one of the six cells. There are 6 r different 
arrangements, and they could be made distinguishable, for example, by 
using dice of different colors or size. Usually, however, we are unable 
(or unwilling) to identify the individual dice. With 6 dice the event 
“no two faces alike” can occur in 6! = 720 different ways, but they are 
considered indistinguishable. 

* Starred chapters treat special topics and may be omitted at first reading. 

51 



52 


OCCUPANCY PROBLEMS 


[3.1 


(c) Situations of a similar nature are common. For example, the 
days of the year correspond to cells, and people’s birthdays to things 
placed into the cells. The birthdays of r people can be distributed in 
365 r sjvays [cf. Chapter 2, example (3.c)j. For purposes of birthday 
statistics it does not matter whether Peter or Paul has a birthday, and 
we agree to treat birthdays as indistinguishable. Again, in the elevator 
example [(3.d) of Chapter 2] the floors are cells into which passengers 
are placed, and it is natural not to distinguish between passengers. 
When a boy collects coupons found in cereal packages, the different 
kinds of coupons correspond to cells. The collector sees only the num¬ 
ber of coupons of each kind, which means that he does not distinguish 
between all n r arrangements. 


To find the number of distinguishable arrangements of r in¬ 
distinguishable objects in n cells, we shall represent the cells 
by the spaces between n + 1 bars and the objects by A’ s. Thus 
| AAA | A HI! A AAA || is used to indicate that there are seven cells 
in all and that they contain 3, 1,0, 0, 0, 4, 0 things, respectively. The 
arrangement always starts and ends with a bar, but we may put the 
remaining n — 1 bars and r letters in an arbitrary order. The arrange¬ 
ment is fixed by selecting the r places in which the letters stand. 
Hence there are as many distinguishable arrangements as there are 

( U *4" T — 1 

, 

r 





We have thus proved 


Lemma 1. There are n r different ways of putting r objects into n cells. 
With indistinguishable things the number of distinguishable arrangements 

( n + t — 1\ /n + r — 1\ 

r / \ n - 1 / 

r + 5 
5 

guishable arrangements. For a throw with r coins there are r + 1 
distinguishable results; in fact, the number of heads is 0, 1, • • •, r. 

(e) Partial derivatives. Consider an analytic function of n variables 
/(#i, x n ). Formally we can calculate its partial derivatives of 

order r in n T ways. However, the order of differentiations plays no 
role, and it matters only how often each variable appears. We have 
one cell corresponding to each variable, and our lemma shows that 
(n + r — 1\ 

there exist I 1 different partial derivatives of rth order . A 


j distin- 


Examples. (d) A throw with r dice can result 



3.2] 


BOSE-EINSTEIN AND FERMI-DIRAC STATISTICS 


53 


function of three variables has 15 derivatives of fourtli and 21 deriva¬ 
tives of fifth order. 


Lemma 2. Let r > n. The number of ways in which r indistinguish¬ 
able things can be put into n cells , none of which is to be empty , is 



Proof. In the arrangement of letters and bars described above it 
is now required that at most one bar appears between any two letters. 
There are r — 1 spaces between the letters, and we can choose any 
n — 1 among them as places for bars. 


2. Bose-Einstein and Fermi-Dirac Statistics 

In the examples of the preceding section it was natural to assume 
that all n r arrangements have equal probabilities. We shall now 
consider cases where facts and experience have compelled physicists 
to abandon this hypothesis and to assign probabilities in different 


ways. 

Consider a mechanical system of r indistinguishable particles. In 
statistical mechanics it is usual to subdivide the phase space into a 
large number, n ) of small regions or cells so that each particle is assigned 
to one cell. In this way the state of the entire system is described in 
terms of a random distribution of the r particles in n cells. Offhand it 
would seem that (at least with an appropriate definition of the n cells) 
all n r arrangements should have equal probabilities. If this is true, the 
physicist speaks of AlarwellrBoltzmann statistic& (the term “statistics” 
is here used in a sense peculiar to physics). Numerous attempts have 
been made to prove that physical particles behave in accordance with 
Maxwell-Boltzmann statistics, but modern theory has shown beyond 
doubt that this statistics does not apply to any known particles; in no 
case are all n r arrangements approximately equally probable. Two 
different probability models have been introduced, and each describes 
satisfactorily the behavior of one type of particle. The justification 
of either model depends on its success. Neither claims universality, 
and it is possible that some day a third model may be introduced for 
certain kinds of particles. 

Remember that we are here concerned only with indistinguishable 
particles. We have r particles and n cells. By Bose-Einstein statistics 
we mean that only distinguishable arrangements arc considered and that 


each is assigned probability 1 



It is shown in statistical 


mechanics that this assumption holds true for photons, nuclei, and 



54 


OCCUPANCY PROBLEMS 


[3.3 


atoms containing an even number of elementary particles. 1 To describe 
other particles a third possible assignment of probabilities must be 
introduced. By Fermi-Dirac statistics we understand these hypotheses: 

(1) it is impossible for two or more particles to be in the same cell, and 

(2) all distinguishable arrangements satisfying the first condition have 
equal probabilities. The first hypothesis requires that r < n. An 
arrangement is then completely described by stating which of the n 
cells contain a particle, and since there are r particles the corresponding 


cells can be chosen in 



ways. 


Hence, with Fermi-Dirac statistics 


there are in all 



possible arrangements , each having probability 



This model applies to electrons, neutrons, and protons. We have 
here an instructive example of the impossibility of selecting or justify¬ 
ing probability models by a priori arguments. In fact, no pure reason¬ 
ing could tell that photons and protons would not obey the same 
probability laws. (Essential differences between Maxwell-Boltzmann 
and Bose-Einstein statistics are discussed in problems 6-10. Cf., in 
particular, problems 8 and 9.) 


Example. Let n = 5, r = 3. The arrangement (A | — | A | A | —) 
has probability 6/125, 1/35, or 1/10, according to whether Maxwell- 
Boltzman, Bose-Einstein, or Fermi-Dirac statistics is used. 


3. The Classical Occupancy Problem 

We return to the random distribution of r objects in n cells and 
assume again that each of the n r possible arrangements has probability n~ r . 
The probability that a specified cell contains exactly k objects is then 


(3.1) 


Vk 


(r\ (n - iy~* 
\k) n r 


(if k > r the binomial coefficient vanishes so that pk — 0, as is proper). 
To prove (3.1), it suffices to note that the k objects can be chosen in 

ways, and the remaining r — k objects can be placed into the 

remaining n — 1 cells in (n — i) r ~~ A: ways. 

Example. A sequence of 100 random digits represents a distribution 
of 100 things into ten cells. Accordingly, the probability pk that 
the digit 7 appears among 100 random digits exactly k times is given 

1 Cf. H. Margenau and G. M. Murphy, The Mathematics of Physics and Chem¬ 
istry , New York, 1943, Chapter 12. 




3.3] 


THE CLASSICAL OCCUPANCY PROBLEM 


55 


by (3.1) with r = 100, n = 10. Table 1 of Chapter 2 gives the actual 
counts of the occurrences of the digit 7 in 100 sequences of 100 digits 
each. Interpreting probabilities as long-run frequencies, we should 
expect the number of sequences in which the 7 occurs exactly k times 
to be approximately lOOp*. Table 1 compares the theory with actual 
counts. Presumably doubts will arise in the reader’s mind as to 
whether the counts confirm the theory. The theory of the chi-square 
test provides objective means of judging the closeness of observed 
frequencies to theoretical probabilities. It turns out that under ideal 
circumstances roughly in two out of three cases chance fluctuations 
would produce deviations larger than those exhibited in Table 1. 


TABLE 1 

Occurrence of 7 among 100 Random Digits 


k 

100 PA; 

N k 

0 

0.003 

0 

1 

0.030 

0 

2 

0.1G2 

0 

3 

0.589 

1 

4 

1.587 

1 

5 

3.389 

6 

6 

5.958 

6 

7 

8.890 

11 

8 

11.482 

18 

9 

13.042 

8 

10 

13.186 

14 


k 

100p* 

N k 

11 

11.988 

9 

12 

9.879 

10 

13 

7.430 

2 

14 

5.130 

5 

15 

3.268 

3 

16 

1.929 

3 

17 

1.059 

2 

18 

0.543 

0 

19 

0.260 

0 

20 

0.117 

1 


pk is the probability of exactly k occurrences, and N k the observed number of 
occurrences in the 100 batches of 100 digits each recorded in Table 1 of Chapter 2. 


Formula (3.1) is a special case of the so-called binomial distribution 
which will be taken up in Chapter 6, where we shall examine various 
properties of (3.1) and see how it can be evaluated when r and n are 
large. Note that (3.1) does not solve all problems of occupancy. For 
example, the theory thus far developed does not permit the computa¬ 
tion of the probability that k cells will be empty. Tools appropriate 
for such problems will be developed in section 3 of the next 
chapter. 

For Bose-Einstein statistics the probability q k of a fc-fold occupancy 
of any specified cell is given by a formula analogous to (3.1). It is 
worth noting here that the two sequences pt and qk exhibit completely 
different characters (cf. problems 7-9). 



56 


OCCUPANCY PROBLEMS 


[3.4 


4. Runs 

In any ordered sequence of elements of two kinds, each maximal 
subsequence of elements of like kind is called a run . For example, 
the sequence AAABAABBBA opens with an A-run of length 3; it is 
followed by runs of length 1, 2, 3, 1, respectively. The A - and J3-runs 
alternate so that the total number of runs is always one plus the 
number of unlike neighbors in the given sequence. 

Examples, (a) An observation 2 yielded the following arrangement of 
empty and occupied seats along a lunch counter: EOEEOEEEOEEE - 
OEOE. Note that no two occupied seats are adjacent. Can this be 
due to chance? With five occupied and eleven empty seats it is im¬ 
possible to get more than eleven runs, and this number was actually 
observed. It will be shown later that if all arrangements were equally 
probable the probability of eleven runs would be 0.0578.... This 
small probability to some extent confirms the hunch that the separa¬ 
tions observed were intentional. This suspicion cannot be proved by 
statistical methods, but further evidence could be collected from con¬ 
tinued observation. If the lunch counter were frequented by families, 
there would be a tendency for occupants to cluster together, and this 
would lead to relatively small numbers of runs. Similarly, counting 
runs of boys and girls in a classroom might disclose the mixing to be 
better or worse than random. Improbable arrangements give clues to 
assignable causes; an excess of runs 'points to intentional mixing , a paucity 
of runs to intentional clustering . It is true that these conclusions arc 
never foolproof. Even with perfect randomness improbable situations 
occur and may mislead us into a search for assignable causes. However, 
this will be a rarity, and with an appropriate criterion we shall in actual 
practice be misled once in 100 times but find assignable causes 99 out 
of 100 times. 

Examples for statistical applications of the theory of runs occur in 
industrial quality control as introduced by Shewhart. As washers are 
produced, they will vary in thickness. Long runs of thick washers 
may suggest imperfections in the production process and lead to the 
removal of the causes; thus oncoming trouble may be forestalled and 
greater homogeneity of product achieved. 

In biological field experiments one counts successions of healthy and 
diseased plants, and long runs are suggestive of contagion. The 

2 F. S. Swed and C. Eisenhart, Tables for Testing Randomness of Grouping in 
a Sequence of Alternatives, Annals of Mathematical Statistics , vol. 14 (1943), pp. 
66-87. 



3.4] 


RUNS 


57 


meteorologist watches successions of dry and wet months 3 to discover 
clues to a tendency of the weather to persist. 

(b) In physics, the theory of runs is used in the study of cooperative 
phenomena. In Ising’s theory of one-dimensional lattices the energy 
depends on the number of unlike neighbors, that is, the number of runs. 
In a more refined theory the lengths of the runs also play a role. Here, 
as in agricultural experiments, it is desirable to generalize the theory 
of runs to multidimensional arrangements. At present only one¬ 
dimensional arrangements have been investigated in detail. 


Many questions relative to runs 4 can be asked, but we shall prove 
here only the following 

Theorem. Suppose that all arrangements of i\ elements of one kind 
and r 2 elements of a second kind have equal probabilities , and let Pk be 
the probability that an arrangement contains exactly k runs. Then , for 
an even number k = 2v 


(4.1) 

while for k 


P 2p — 2 

2v + 1 




(4.2) 


P 2 ^ 4*1 

-{C: 1 )C:, , ) + C:, l )C: , )} + cr) 


(cf. problems 14 -16). 

Proof. The r = r x + >2 elements can be arranged in r! ways. How¬ 
ever, any permutation among the r x elements of the first kind, or 
among the r 2 elements of the second kind, will leave the outer appear¬ 
ance unchanged. Hence there exist 


(4.3) 


r\ ___ / r x + r 2 \ _ M + r 2 \ 

r x !r 2 ! V r x / \ r 2 / 


distinguishable orderings, and each represents r { \r 2 \ different arrange¬ 
ments of the ?’i + r 2 elements. It follows that all distinguishable 
orderings have equal probabilities, and the denominator in either (4.1) 

3 W. G. Cochran, An Extension of Gold’s Method of Examining the Apparent 
Persistence of One Type of Weather, Quarterly Journal of the Royal Meteorological 
Society , vol. 64, No. 277 (1938), pp. 631-634. 

4 For further results and literature see S. S. Wilks, Mathematical Statistics , Prince¬ 
ton, 1943, Chapter 10. 



58 


OCCUPANCY PROBLEMS 


[3.5 


or (4.2) is the number of distinguishable orderings. We have to prove 
that the numerators represent the numbers of distinguishable arrange¬ 
ments with 2v or 2v + 1 runs, respectively. Consider first the case 
k = 2v. Then we have v runs of the first kind and v runs of the second 
kind. Each run represents a cell. By lemma 2 of section 1 the r x 

(n - 1\ 

elements of the first kind can be distributed in ( ^ j ways into v 


cells none of which is empty. Similarly, the elements of the second 


kind can be distributed in 



distinguishable ways. 


Finally, 


the runs alternate but a run of either kind may be first. This accounts 
for the numerator in (4.1). If the number of runs is odd, say, 2v + 1, 
then there are v + 1 runs of one kind and v runs of the other, and the 
same method of counting shows the numerator in (4.2) to be the number 
of distinguishable orderings with 2^+1 runs. 

In the paper by Swed and Eisenhart quoted above, the probabilities 
Pic are tabulated for all r 1 and r 2 up to 20. 


Examples. If r x = r 2 — 2, an arrangement may consist of 2, 3, or 
4 runs and these possibilities have equal probabilities. 

For r x — 2, r 2 = 3 the number of runs equals 2, 3, 4, or 5 with 
probabilities 0.2, 0.3, 0.4, 0.1, respectively. 

For T\ — r 2 = 3, the probabilities of 2, 3, 4, 5, 6 runs are 0.1, 0.2, 
0.4, 0.2, 0.1. 

In the lunch-counter example (4.a) we have r x = 5 and r 2 = 11. 


The probability of 11 runs 



3 

52 


5. Problems for Solution 

1. If ri indistinguishable things of one kind and 7*2 indistinguishable things of a 
second kind are placed into n cells, find the number of distinguishable arrangements. 

2. If r\ dice and 7*2 coins are thrown, how many results can be distinguished? 

3. In how many different distinguishable ways can r x white, 7*2 black, and r 3 
red balls be arranged? 

In problems 4-6 let n be the number of cells and r the number of objects; assume that 
the objects are distinguishable , and that all n r distributions are equally probable. 

4. When r = n, find the probabilities that (a) no cell, (b) only one cell remains 
empty. 

5. Find the probability that, the first cell contains k\ things, the second k% 

things, etc., where k\ + H-b k n = r. 

6 . The most probable number of things in any given cell is the integer vq with 
(r — n + 1 )/n < vo < (r + l)/w. More precisely the probabilities (3.1) satisfy 
the relation po < pi < P 2 < • • • < p v -\ < p v > Pv+i >•••> Pr (cf- problem 8). 



3.5] 


PROBLPJMS FOR SOLUTION 


59 


In 'problems 7—10 r and n have the same meaning as above, but we assume that the 
objects are indistinguishable and that all distinguishable arrangements have equal 
probabilities ( Bose-Einslein statistics). 


7. The probability that a given cell contains exactly k things is 


(5.1) 


Qk 


-( 


n -f- r — k — 2 




8. Show that when n > 2 zero is the most probable number of things in any 
cell, or more precisely, qo > qi > q 2 > ... (ef. problem 6). 

9. Limit theorem. Let n —> oo and r —> qo, so that the average number of 
particles per cell, r/n, tends to X. Then 


(5.2) 


Qk 


X* _ 
0 +X)*H 


[The right side is known as the geometric distribution. The corresponding limiting 
form for (3.1) is the Poisson distribution; ef. Chapter (>.] 

10. The probability that exactly m cells remain empty is 


(5.3) 


Pm — 





(Cf. problem 4. A similar general formula for the ease where all n r arrangements 
are equally probable will be given in Chapter 4, section 5.) 

11. From the meanings of the probabilities p/,, qk, and p v [cf. (3.1), (5.1), and 
(5.3)] it follows that wpA: = 2 qk — ~pk = 1. Prove this algebraically from identities 
which were given in Chapter 2 (either in the text or as problems in section 9). 


Further theorems on runs: In the following problems we consider arrangements of r\ 
alphas and r 2 betas and assimie that all arrangements are equally probable. 


12. The probability that the arrangement starts with an alpha run of length v is 


(ri)„r 2 


in -f r 2 — V — 
\ r\ — v 



in + /’2W1 


13. The probability that the arrangement starts with a beta is r 2 /(ri -f r 2 ). 

14. From the theorem of section 4 deduce that the most probable number of 


2 r\r 2 2 r\r 2 

runs is an int('g('r k with-< k < —-b 3. 

n + r 2 n ■+ r 2 

15. The probability of having exactly k runs of alphas is 


fr 1 - 

\ (n + l\ 

In + r 2 \ 

1 

II 

)\.k ) 

+ \ n ) 


(Hint: Use the theorem of section 4 and note that the betas are arranged in 
k — 1, k, or k + 1 runs. Cf. Chapter 7, problem 11, and Chapter 9, problem 12.) 

16. The probability for the alphas to be arranged in k runs of which &i are of 
length 1, k 2 of length 2, • • •, k v of length v (with A*i H-h k v = AO is 

Ad /r 2 + 1\ In 4- r 2 \ 

ki'.k 2 \ W \ k ) \ n J 




CHAPTER 4 


COMBINATION OF EVENTS 

This chapter is concerned with events A which are defined in terms 
of certain other events A\, A 2 , • • •, A#. Thus in a game of bridge the 
event A , “at least one player has a complete suit,” is the union of 
the four events A *, “player number k has a complete suit” (k = 1, 2, 
3,4). Of the events A & one, two, or more can occur simultaneously, and, 
because of this overlap, the probability of A is not the sum of the four 
probabilities Pr{A*}. Given a set of events A x , •••, An, we shall 
show how to compute the probabilities that 0, 1, 2, 3, ... among them 
occur. The formulas are useful for certain applications and are also 
of theoretical interest. However, in this book the formulas will not 
be used explicitly, and the present chapter can therefore be omitted 
at a first reading. As a compromise it is suggested to study only 
section 1. 

The material of this chapter and a variety of applications are covered 
in a recent monograph by M. Frdchet 1 , to which the reader is referred 
for further information. 

1. Union of Events 

If Ai and A 2 are two events, then A = A x U A 2 denotes the event 
that either A x or A 2 or both occur. By formula (6.4) of Chapter 1 we 
have 

(1.1) Pr\A\ = Pr{A x ) + Pr{A 2 } - Pr{A x A 2 ). 

We want to generalize this formula to the case of N events A x , A 2) • • •, 
A n } that is, we wish to compute the probability of the event that at 
least one among the A& occurs. In symbols this event is 
A = A\ U A 2 U ••• UAj\r. For our purpose it is not sufficient to 
know the probabilities of the individual events A*, but we must be 
given complete information concerning all possible overlaps. This 
means that for every pair (t, j), every triple (t, j, k) } etc., we must 
know the probability of A* and Aj, or Ai, Aj, and A*,, etc., occurring 

1 Les probability associ6es & un syst&me d’6v6nemcnts compatibles et depen¬ 
dants, Actualitis scientifiques et industrielles, nos. 859 and 942, Paris, 1940 and 1943. 

60 



UNION OF EVENTS 


4 . 1 ] 


61 


simultaneously. For convenience of notation we shall denote these 
probabilities by the letter p with appropriate subscripts. Thus 


(1.2) pi = Pr{Ai}> p itj = Pr{AiAj ), p itjtk = Pr{AiAjA k } y .... 

The order of the subscripts is irrelevant, but for uniqueness we shall 
always write the subscripts in increasing order; thus, we write £ 3 , 7,11 
and not £ 7 , 3 , 11 - Two subscripts are never equal. For the sum of all 
p ’s with r subscripts we shall write S r , that is, we define 


(1.3) >Si = Spi, S 2 = 2 pij y Ss = 'Zpij'k, — 


Here i < j < k < • * • < TV, so that in the sums each combination 

/N\ 

appears once and only once; hence S r has I 1 terms. The last "sum, 

Sn, reduces to the single term £1,2,3, --.v, which is the probability of 
the simultaneous realization of all N events. For iV = 2 we have 
only the two terms S\ and S 2f and formula (1.1) can be written 

(1.4) Pr{A\ = Si — S 2 . 


The generalization to an arbitrary number N of events is given in the 
following 

Theorem. The probability Pi of the realization of at least one among 
the events A\ y A 2y • • •, An is given by 

(1.5) Pi — Si — S 2 + $3 — $4 H-- • » dz Sn- 


Proof. We prove (1.5) by the so-called method of inclusion and 
exclusion. To compute Pi we should add the probabilities of all 
sample points which are contained in at least one of the Ai y but each 
point should be taken only once. To proceed systematically we first 
take the points which are contained in only one A t - y then those con¬ 
tained in exactly two events and so forth, and finally the points 
(if any) contained in all A t \ Now let E be any sample point contained 
in exactly n among our N events Ai. Without loss of generality we 
may number the events so that E is contained in A\ y A 2y • • •, A n but not 
contained in A n + 1 , A n+2y • • •, An - Then Pr[E) appears as a contri¬ 
bution to those pi y Pij y Pij ky ... whose subscripts range from 1 to n . 


Hence Pr[E} appears n times as a contribution to Si, and 



times 


as a contribution to S 2y etc. In all, when the right-hand side of (1.5) 
is expressed in terms of the probabilities of sample points we find 



02 COMBINATION OF EVENTS ' {4.2 

Pr{E\ with the factor 

( , 6) »-C) + C)- + -±Q 

To prove the theorem we have to show that this number equals 1. 
This follows at once on comparing (1.6) with the binomial expansion 
of (1 — l) n [cf. (6.11) of Chapter 2]. The latter starts with 1, and the 
terms of (1.6) follow with reversed sign. Hence for every n > 1 the 
expression (1.6) equals 1, and this proves the theorem. 


2. Examples 

(a) In a game of bridge let A t - be the event “player number i has a 

/52\ 

complete suit.” Then p* = 4/^1 ; the event that both player i 

and player j have complete suits can occur in 4-3 ways and has prob- 
/52\ /39\ /52\ /39\ /26\ 

ability - 12/( J (J i ^toilariy W - W(J (J (J • 


Finally, pi, 2 , 3,4 = Pi, 2 , 3 . since whenever three players have a complete 
suit so does the fourth. The probability that some player has a com¬ 
plete suit is therefore by (1.5) 


( 2 . 1 ) 


Pi = 


16 


QQ-»© 

000 


+ 72 


Using Stirling's formula, we see that Pi = 34 10~” 10 approximately. In 
this particular case P\ is very nearly the sum of the probabilities of A,-, 
but this is the exception rather than the rule. 

(b) Matches ( Coincidences ). The following problem with many 
variants and a surprising solution goes back to Montmort (1708). It 
has been generalized by Laplace and many other authors. 

Two similar decks of N different cards each are put into random 
order and matched against each other. .If a card occupies the same 
place in both decks, we speak of a match (coincidence or rencontre). 
Matches may occur at any of the N places and at several places simul¬ 
taneously. This experiment may be described in more amusing forms. 
For example, the two decks may be represented by a set of N letters 
and their envelopes, and a capricious secretary may perform the random 
matching. Alternatively we may imagine the hats in a checkroom 



4.2] 


EXAMPLES 


*3 


mixed and distributed at random to the guests. A match occurs if a 
person gets his own hat. It is instructive to venture guesses as to how 
the probability of a match depends on A: how does the probability 
of a match of hats in a diner with 8 guests compare with the corre¬ 
sponding probability at a gathering of 10,000 people? It seems sur¬ 
prising that the probability is practically independent of A and 
roughly 2/3. (For less frivolous applications cf. problems 10 and 11.) 

The probabilities of having exactly 0, 1, 2, 3, ... matches will be 
calculated in section 4. Here we shall find only the probability Pi of 
at least 1 match. For simplicity of expression let us renumber the 
cards 1, 2, • • •, A in such a way that one deck appears in its natural 
order, and assume that each permutation of the second deck has 
probability 1/A! Let A* be the event that a match occurs at the fcth 
place. This means that card number k is at the fcth place while the 
remaining A — 1 cards may be in an arbitrary order. Clearly 
Vk = (A — 1)!/A! = 1/A. Similarly, for every combination i , j, we 
have Pij = (A — 2)!/A! = 1/A (A — *1), etc. The sum S r contains 
N\ 

) terms each of which equals (A — r)!/A!. Hence S r — 1/r!, and 
r / 

from (1.5) we find the required probability to be 


( 2 . 2 ) 


Pi - 1 


1 1 
2! + 3! 


+ ... ± 


1 

m' 


Note that 1 — Pi represents the first A + 1 terms in the expansion 
, 1 1 1 

(2.3) e~ l = 1 — 1 +-+-+ ... 

2! 3! 4! 


Therefore we have with a good approximation 

(2.4) Pi « 1 - e- 1 = 0.63212.... 

The degree of approximation is shown in the following table of correct 
values of Pi: 

N - 3 4 5 6 7 

Pi * 0.66667 0.62500 0.63333 0.63196 0.63214 

(c) A Sampling Problem. A pack of cards consists of s identical 
series, each containing n cards numbered 1, 2, n. A random 
sample of r > n cards is drawn from the pack without replacement. 
We require the probability u r that each number is represented in the 
sample. As a particular case consider a deck of bridge cards. For 
s = 4, n = 13 we get the probability that a hand of r cards contains 



64 COMBINATION OF EVENTS *14.3 

all thirteen values (cf. footnote on p. 9). For s = 13, n = 4 we get 
the probability that all four suits are represented. 

To calculate u r let A v be the event that number v does not occur in 
the sample. Remembering that r cards out of m can be selected in 
( m) r = m(m — 1) • • • (m — r + 1) ways, we find 


(ns - s) r (ns - 2s) r 

(2.5) * Vi = ———, Va = ——-, .... 

(ns) r (ns) r 

Since the number of events is N = n, we have Si = npi, S 2 = 0 V* 

etc. Substituting into (1.5), we find for the probability that somt 
number does not occur 


(2.6, (-I)*"' 

jt=i \k/ (ns) r 

[cf. problems 12-14 and Chapter 9, example (3.d)]. 

If in (2.6) we let the number of series s —> oo, then clearly 


)H)' 


It is intuitively clear that in the limit our sampling becomes sampling 
with replacement from the population of the numbers 1, 2, •••, n. 
The right side of (2.7) is then the probability that in a random sample 
of size r each element appears at least once. This is also the probability 
of having no empty boxes if r balls are put into n boxes [cf. equation 
(5.4) with m = 0]. Finally, (2.7) answers the question of the collector 
of coupons as to how many coupons he will have to acquire before 
having a complete series of n coupons. (Cf. problems 12-14.) 


3. The Realization of m among N Events 

Theorem . For any integer m with 1 < m < N the probability P[ m ) 
that exactly m among the N events A it - •«, An occur simultaneously is 
given by 

(3.1) P [ m ] 



Note: According to (1.5), the probability P[ 0 ] that none among the 
Aj occurs is 

(3.2) P [0 ] ® 1 — Pi = 1 — S\ + $2 ■” $3 ^ Sn. 



4.3] 


THE REALIZATION OF m AMONG N EVENTS 


65 


This shows that (3.1) gives the correct value also for m s= 0 provided 
that we put /S 0 = 1. 


Proof. We proceed as in the proof of (1.5). Let E be an arbitrary 
sample point, and suppose that it is contained in exactly n among the 
N events Aj. Then Pr{E } appears as a contribution to P[ m ] only if 
n = m. To investigate how Pr{E\ contributes to* the right side of 
(3.1), note that Pr{E\ appears in the sums S\ } S 2 , but not in 

S n + 1 , • • *, Sat. It follows that Pr{E) does not contribute to the right 
side in (3.1) if n < m. If n = m ) then Pr{E } appears in one and only 
one term of To complete the proof of the theorem it remain^ to 
show that for n > m the contributions of Pr{E\ to the terms S m , 
S m + 1 , • • •, S n on the right in (3.1) cancel. Now out of the n events 

containing E we can form &-tuplets; hence Pr{E\ appears in Sk 

. For n > m the total contribution of Pr{E] to 
the right side in (3.1) is therefore 


(3.3) ( n )-( m+1 V " ) 

\m/ \ m / \m + 1/ 

+ n (")- 

\ rn / \m + 2/ 

( in + A / n \ ( n\ (n — m\ 

)( )= )( ) 

m / \m + v / \m/ \ v / 

duces to 

( n\ \(n — m\ (n — m\ 

Jl( o )-( 1 ) + -- 5F 


+ ...T 



, and hence (3.3) re- 



Within braces we have the binomial expansion of (1 — \) n m so that 
(3.3) vanishes, as asserted. 


Example. Quadruples in a Bridge Hand. By a quadruple we shall 
understand 4 cards of the same face value, so that a bridge hand of 13 
cards may contain 0, 1, 2, or 3 quadruples. A hand can be selected 



ways, and we attribute probability p 



to each way. 


We have here N — 13 possible quadruples, or events. Four aces and 


9 other cards can be chosen 


-ra 


ways. For reasons of symmetry 



66 


COMBINATION OP EVENTS 


[ 4.4 


we get similarly for all i, j,k ... 

* - C i >) p ’ p, ‘~ (t 4 ) 

and hence 

S, = 13 Q Pi & - (“) Q ?, & - Q 40p 

while S 4 = S 5 = • • • = S 13 = 0, since it is impossible that more than 3 
quadruples occur. The probabilities of 0, 1, 2, or 3 quadruples are 
therefore 

- 1 - { i3 0 - O O + (3 ) 4o l» - omm7 ■ ■ ■ 

p ‘« -1 13 0 - 2 O O +3 0 * - 00340669 • • ■ 

Pia -{(DC)- 3 Q io \ p - 0000,334 - 

= (g 3 )4 °p ® 2 10-9 


4. Application to Matching and Guessing 

In example (2.6) we considered the matching of two decks of cards 
and found that Sk = 1/Jfel. Substituting into (3.1), we find the follow¬ 
ing result. 

In a random matching of two similar decks of N distinct cards the 
\probability P[ m ] of having exactly m matches is given by 


(4.1) 


Pm = 


Pm = 


1 1 + 2! 3! + " ’ ± (N - 3)! ^ (N - 2)! 

1 ^ 1 
± (AT - 1)1 AH 

11 1 1 

1-14-1-±-T- 

2! 3! (JV — 3)! (N — 2)1 

1 


(N - D! 



44] 


APPLICATION TO MATCHING AND GUESSING 


67 


1 1 


2! 

1 - I +- 
( 2! 


± (N - 3)! 

1 1 

1 

1 

1 

3! 1 

1 - 1 + - 
2! 

i 

+ 

1 CO 

i 

CO 

1 

-H 


P[N-2\ = 


1 f 1 

-- 1 - 1 +- 


Pin- i] = 


(N - 2)! 
1 

(Ar^iyf 


l + 2! 

1 - 11=0 


P[iVj = —: 


N\ 


The last relation is obvious. The vanishing of P\n-i] expresses the 
impossibility of having N — 1 matches without having all N cards in 
the same order. 

The braces to the right in (4.1) contain the initial terms of the expan¬ 
sion of e~ x . For large N we have therefore approximately 

(4.2) P lm] ~ —- e~ l . 


In Table 1 the columns headed P[ m ] give the exact values of P[ m ] for 
N = 3, 4, 5, 6, 10. The last column gives the limiting values 


(4.3) 


Pm 


0-1 


ml 


It will be noticed that the approximation of p m to P[ m ) is rather good 
even for moderate values of N. 

For the numbers p m defined by (4.3) we have 2 pk = ^(l + 1 + 
= eT x e = 1. Accordingly, the pk may be interpreted 

2! 3! 

as probabilities. As a matter of fact, we shall later see that the im¬ 
portant Poisson distribution leads, for a particular value (unity) of a 
parameter, to an assignment of probabilities in accordance with (4.3). 
Formulas (4.1) are useful in testing guessing abilities . In wine 






68 


COMBINATION OF EVENTS 


[ 4.4 


TABLE 1 

Probabilities of m Correct Guesses in Calling a Deck of N Distinct Cards 



CO 

II 

N - 4 

N - 5 

co 

II 

£ 

N - 10 

Pm 

P [m] b m 

P [w»] b m 

P [m] bm 

P [m] bm 

P [m] bm 

0 

0.333 0.296 

0.375 0.316 

0.367 0.328 

0.368 0.335 

0.36788 0.34868 

0.367879 

1 

.500 .444 

.333 .422 

.375 .410 

.367 .402 

.36788 .38742 

.367879 

2 

. .. .222 

.250 .211 

.167 .205 

.187 .201 

.18394 .19371 

.183940 

3 

.167 .037 

... .047 

.083 .051 

.056 .053 

.06131 .05733 

.061313 

4 


.042 .004 

. . . .006 

.021 .008 

.01534 .01116 

.015328 

5 



.008 .000 

... .001 

.00306 .00149 

.003066 

6 




.001 .000 

.00052 .00014 

.000511 

7 





.00007 .00001 

.000073 

8 





.00001 . 

.000009 

9 






.000001 

10 






.000000 


The P [ m ] are given by (4.1), the b m by (4.4). The last column gives the Poisson 
limits (4.3). 


tasting, psychic experiments, etc., the subject is asked to call an 
unknown order of N things, say, cards. Any actual insight on the 
part of the subject will appear as a departure from randomness. To 
judge the amount of insight we must appraise the probability of turns 
of good luck. Now chance guesses can be made according to several 
systems among which we mention three extreme possibilities. (1) The 
subject sticks to one card and keeps calling it. With this system he is 
sure to have one, and only one, correct guess in each series; chance fluc¬ 
tuations are eliminated. (2) The subject calls each card once so that 
each series of N guesses corresponds to a rearrangement of the deck. If 
this system is applied without insight, formulas (4.1) should apply. 
(3) A third possibility is that N guesses are made absolutely independ¬ 
ently of each other. There are N n possible arrangements. It is true 
that every person has fixed mental habits and is prone to call certain 
patterns more frequently than others, but in first approximation we 
may assume all N n arrangements to be equally probable. Since m cor- 

/N\ 

rect and N — m incorrect guesses can be arranged in ( ) (N — 1)^-™ 

different ways, the probability of exactly m correct guesses is now 




4.5] APPLICATION TO THE CLASSICAL OCCUPANCY PROBLEM 69 

/N\ (N - 1)*-" 

(44) ^-(J—— 

(This is a special case of the binomial distribution; cf. Chapter 6.) 

Table 1 gives a comparison of the probabilities of success when 
guesses are made in accordance with system (2) or (3). To judge the 
merits of the two methods we require the theory of mean values and 
probable fluctuations. It turns out that the average number of correct 
chance guesses is one under all systems; the chance fluctuations are 
somewhat larger under system (2) than (3). A glance at Table 1 will 
show that in practice the differences will not be excessive. 


* 6. Application to the Classical Occupancy Problem 

We now return to the problem of Chapter 3, section 3, and consider a 
random distribution of r things in n cells, assuming that each arrange¬ 
ment has probability n~ r . We seek the probability P[ m ] of finding 
exactly m cells empty. 

Let Ak be the event that cell number k is empty (k = 1, 2, • • •, n). 
In this event all r balls are placed in the remaining n — 1 cells, and 
this can be done in (n — l) r different ways. Similarly, there are 
(n — 2) r arrangements, leaving two preassigned cells empty, etc. 
Accordingly 


(5. 




and hence for every v < n 

<"> *- 00 -;$ 

Substituting into (3.1), we find 

n ZP (m + A / n \ / m + v\ r 

(5.3) P [m] = E (-!)'( )( , )(l-)• 

„=o \ m / \m + v/ \ n / 

( m v\ / n \ 

) l ) 

m / \m + v/ 

( n\ /n — m\ 

j( , ) to 


/ n\ (n - m\ ( m + A r 

(5 - 4) P *"-(J£ < - I)r ( . )0-—)■ 

* Starred sections treat special topics and may be omitted at first reading. 



70 COMBINATION OF EVENTS [4.5 

Such is the probability of finding exactly m cells empty . [For m = Owe 
find the right side of (2.7)]. 

We have already used the model of r random digits to illustrate the 
random distribution of r things in n = 10 cells. Empty cells correspond 
in this case to missing digits: if m cells are empty, 10 — m different 
digits appear in the given sequence. Table 2 provides a numerical 
illustration. 


TABLE 2 

Probabilities P[ m j according to (5.4) for n * 10 


m 

r = 10 

00 

II 

K. 

0 

0.000 363 

0.134 673 

1 

.016 330 

.385 289 

2 

.136 080 

.342 987 

3 

.355 622 

.119 425 

4 

.345 144 

.016 736 

5 

.128 596 

.000 876 

6 

.017 189 

.000 014 

7 

.000 672 

.000 000 

8 

.000 005 

.000 000 

9 

.000 000 

.000 000 


P[ m ] is the probability that exactly m of the digits 0, 1 , • • •, 9 will not appear 
in a sequence of r random digits. 

It is clear that a direct numerical evaluation of (5.4) is limited to 
the case of relatively small n and r. On the other hand, the occupancy 
problem is of particular interest when n is large. If 10,000 balls are 
distributed in 1000 cells, is there any chance of finding an empty cell? 
In a group of 2000 people, is there any chance of finding a day in the 
year which is not a birthday? Fortunately, questions of this kind 
can be answered by means of a remarkably simple approximation to 
P[m] with an error which tends to zero as n —> «>. This approximation 
and the argument leading to it are typical of many limit theorems in 
probability. 

Our purpose, then, is to discuss the limiting form of the formula 
for P[ m ] as n oo and r —* a>. The relation between r and n is, in 
principle, arbitrary. However, the ratio r/n represents the average 
number of things per cell. If it is excessively large, then we cannot 
expect any empty cells; in this case P[ 0 ] is near unity and all P[ m ] with 
m > 1 are small. On the other hand, if r/n tends to zero, then prac¬ 
tically all cells must be empty, and in this case P[ m j 0 for every 
fixed m. Therefore only the intermediate case is of real interest. 



4.5] APPLICATION TO THE CLASSICAL OCCUPANCY PROBLEM 71 


Formula (5.4) for P\ m \ was derived from the expression (5.2) for S VJ 
and we shall derive an approximation to P[ m ] from an approximation 
to S V9 for v fixed. Since 1 — x < e~ x for all positive x, we get directly 

(5 - 6) 

Now put for brevity 

(5.6) n<r r/n = X. 

Then (5.5) takes on the form 

X" 

(5.7) & < — 

v\ 

We now show that the right side is not only an upper bound for S v , 
but also an approximation to it. More precisely, we show that 

1 

(5.8) - \ v - S v -> 0 

p\ 

for every fixed v. If r varies as function of n so that X —»0, then 

(5.8) is implied by (5.7). Now X is small unless r/n 2 is small, and it 
suffices therefore to consider the case where 


(5.9) 


7i 


-> 0. 


The Taylor expansion for e~ x ~ x * shows that for sufficiently small 
positive x we have 1 — x > e~ x ~ x . Therefore, at least for sufficiently 
large n, 


(5.10) 





— vr/n — v^r/n 2 

a 


In view of (5.9) the term within braces tends to unity, so that the right 
side is asymptotically equivalent to X/v \. The two inequalities (5.7) 
and (5.10) therefore imply (5.8). 

We now introduce the approximation (5.8) into the formula (3.1) 
for P[ m ). For the vth term on the right side we have found (at least 
for large m) 




72 COMBINATION OF EVENTS [4.5 


This means that the expansion (3.1) for P[ m ] approaches termwise the 
series 


(5.12) 


\ m f XX 2 X 3 

— 11-1-f- 

m\ l 1! 2! 3! 


Furthermore, by (5.7) the terms of (3.1) are smaller in absolute value 
than those of (5.12). If X is restricted to a finite interval, then the 
series within braces in (5.12) converges absolutely and uniformly to 
e~ x . It follows that in the limit (5.12) represents P[ m ] and we have 
thus proved the 


Theorem . 2 If n and r increase so that X 
then for every fixed m 


(5.13) 



= ne T,n remains bounded , 

0 . 


The approximating expressions 

(5.14) p(m; A) = e~ x — 

ml 


define the so-called Poisson distribution, which is of great importance 
and describes a variety of phenomena. For the particular value 
X = 1 we get once more the distribution (4.3). 

In practice one may use pirn, X) as an approximation to P[ m \ when¬ 
ever n is great. For moderate values of n an estimate of the error is 
required, but we shall not enter into it. 

Examples, (a) Table 3 gives the approximate probabilities of find¬ 
ing m cells empty when the number of cells is 1000 and the number of 
things varies from 5000 to 9000. For r = 5000 the median value of the 
number of empty cells is six: seven or more empty cells are about as 
probable as six or fewer. Even with 9000 things in 1000 cells we have 
one chance in twelve to find an empty cell. 

( b ) In birthday statistics n = 365, and r is the number of people. 
For r = 1900 we find X = 2, approximately. In a village of 1900 people 
the probabilities of finding m days of the year which are not birthdays are 
approximately as follows: 

P [0] = 0.135, P U1 = 0.271, P [21 = 0.271, P [3] = 0.180, 

P [ 4 ] = 0.090, P [ 5 ] = 0.036, P[ 6 ] = 0 . 012 , P[ 7 ] = 0.003. 

* Due (with a different proof) to R. von Mises, t)ber Aufteilungs- und Besetzungs- 
wahrscheinlichkeiten, Revue de la FaeulU dee Sciences de V University d’Istanbul, 
N.S., vol. 4 (1939), pp. 145-163. 



TABLE 3 


4.5] APPLICATION TO THE CLASSICAL OCCUPANCY PROBLEM 


73 














74 


COMBINATION OF EVENTS 


[ 4.6 


The probability of finding exactly m cells each containing exactly k 
things can be derived in the same way. As von Mises has shown, this 
probability can again be approximated by the Poisson expression 
(5.14), only this time X must be defined by 


(5.15) 



* 6, Miscellany 

(1) The Realization of at Least m Events . With the notations of 
section 3 the 'probability P m that m or more of the events An 

occur simultaneously is given by 

( 6 . 1 ) Pm = P[m] + P[m+ 1 ] + * * * + P[N]- 


To find a formula for P m in terms of Sk it is simplest to proceed by 
induction, starting with formula (1.5) and using the recurrence relation 
Pm +1 = Pm - P[m] ]• One gets for m > 1 


( 6 . 2 ) 


Pm — Sn 





S N . 


It is also possible to derive (6.2) directly, using the argument which 
led to (3.1). 

(2) Further Identities . The coefficients S v can be expressed in terms of 
either P[k] or P& as follows 



Indication of Proof For given values of P[ m \ the equations (3.1) 
may be taken as linear equations in the unknowns S v , and we have to 
prove that (6.3) represents the unique solution. If (6.3) is introduced 
into the expression (3.1) for P[ m j, the coefficient of P[k] (m < k < N) to 
the right is found to be 

* Starred sections treat special topics apd may be omitted at first reading. 



4.7] 


PROBLEMS FOR SOLUTION 


75 


(6.5) 


i: 

v=m 



(~i y-™ 



If k = m this expression reduces to 1. If k > m the sum is the binomial 
expansion of (1 — l)*""™ and therefore vanishes. Hence the substitu¬ 
tion (6.3) reduces (3.1) to the identity P[ m ) = P[ m ]. The uniqueness 
of the solution of (3.1) follows from the fact that each equation intro¬ 
duces only one new unknown, so that the S v can be computed recur¬ 
sively. The truth of (6.4) can be proved in a similar way. 

(3) Bonferroni's Inequalities. A string of inequalities both for P[ m i 
and for P m can be obtained in the following way. If in either (3.1) or 
(6.2) only the terms involving S m , S m + 1 , • • •, S m + r _i are retained while 
the terms involving $ m + r , S w _|_ r+ i, • • •, Sn are dropped, then the error 
(i.e., true value minus approximation) has the sign of the first omitted 
term [namely, (-~l) r ] and is smaller in absolute value. Thus, for r = 1 


and r = 2: 


(6.6) 

S m ~ (m + 1) Sm _|_ i < P[ m ] < S, 

and 


(6.7) 

S m mS m . j_i 5s Pm ^ S m . 


Indication of Proof. 
shown that 

( 6 . 8 ) 


To prove the statement for (3.1) it must be 

Z(-iy^( v )s v >o f 

„=t \m/ 


for every t. Now use (6.3) to write the left side as a linear combination 
of the P[k]. For t < k < N the coefficient of P[k] equals 


k 


£(-i )'-* 


(')(*) = Hz 

\m/ \v/ \m / v==t 



The last sum equals 



and is therefore positive (Chapter 


2, section 9, problem 13). 

For further inequalities the reader is referred to Fr^chet’s mono¬ 
graph cited at the beginning of the chapter. 


7. Problems for Solution 

Note: Assume in each case that all possible arrangements have the same probability. 
1. Ten pairs of shoes are in a closet. Four shoes are selected at random. Find 
the probability that there will be at least one pair among the four shoes selected. 



76 COMBINATION OF EVENTS [4.7 

2. Five dice are thrown. Find the probability that at least three of them show 
the same face. (Verify by the methods of Chapter 3.) 

3. Find the probability that in 5 tossings a coin falls heads at least three times 
in succession. 

4. Solve problem 3 for a head-run of at least length 5 in 10 tossings. 

5. Solve problems 3 and 4 for ace runs when a die is used instead of a coin. 

6. Two dice are thrown r times. Find the probability p r that each of the six 
combinations (1, 1), • • •, (6, 6) appears at least once. 

7. Sampling with replacement. A sample of size r is taken from a population of 
n people. Find the probability u r that N given people will all be included in the 
sample. (This question applies to collecting coupons.) 

8. Continuation. Show that if n —» <» and r —♦ « so that r/n —> p, then 
u r —> (1 — «-*)* 

9. Sampling without replacement. Answer problem 7 for this case and show that 
8 holds with u r —» p N . 

10. In the general expansion of a determinant of order N the number of terms 
containing one or more diagonal elements is N\P\ with Pi defined by (2.2). 

11. The number of ways in which 8 rooks can be placed on a chessboard so that 
none can take another and that none stands on the white diagonal is 8!(1 — Pi), 
where P\ is defined by (2.2) with N = 8. 

12. From (2.6) conclude that 

(-1)* (?) («« - ks) r = 0 

k-0 \«/ 

if r < n and 

2 (-1)* (?) (ns - ks) n = s”n!. 
k=0 \K/ 

13. Solve problem 12 by evaluating the rth derivative, at x — 0, of 

14. In the sampling problem (2.c) find the probability that it will take exactly 
r drawings to get a sample containing all numbers. Pass to the limit as s —> °o. 

15. Bridge-bingo. From the table given in the answer section to problem 40 
of section 8 in Chapter 2, compute the probabilities <?i(r), O 2 W, Qz(r) that after r 
drawings exactly 1, 2, 3 of the 4 players are without cards. 

16. A cell contains N chromosomes, between any two of which an interchange 

of parts may occur. If r interchanges occur [which can happen in «y distinct 

ways], find the probability that exactly m chromosomes will be involved. 3 

17. Find the probability that exactly k suits will be missing in a poker hand 
(for definition cf. footnote on p. 9). 

18. Find the probability that a hand of 13 bridge cards contains the ace-king 
pairs of exactly k suits. 

19. Multiple matching . Two similar decks of N distinct cards each are matched 
simultaneously against a similar target deck. Find the probability Um of having 

• For N » 6 cf. D. G. Catcheside, D. E. Lea, and J. M.Thoday, Types of Chromo¬ 
some Structural Change Introduced by the Irradiation of Tradescantia Microspores, 
Journal of Genetics, vol. 47 (1945-46), pp. 113-149. 



4.7] 


PROBLEMS FOR SOLUTION 


77 


exactly m double matches. Show that uo —► 1 as N —► <» (which implies that 
u m —* 0 for m > 1). 

20. Multiple matching. The procedure of the preceding problem is modified as 
follows. Out of the 2 N cards N are chosen at random, and only these N are matched 
against the target deck. Find the probability of no match. Prove that it tends to 
1/e as N —> oo. 

21. Multiple matching. Answer problem 20 if r decks are used instead of two. 

22. To section 6. Prove that for m > 1 the probability of m or more cells remain¬ 
ing empty is 




23. From (7.1) deduce the identity 4 


(7.2) 


£<-'»* O' 0 - 


valid for n > r. Verify (7.2) by considering the rth derivative of (1 — e x ) n at x — 0. 

24. In the classical occupancy problem, the probability P[ m \(k) of finding exactly 
m cells occupied by exactly k things is 


Pimm = 


( —l) m a!r! 


-j)Kr - 


MW)* 


the summation extending over those j > m for which j < n and kj < r. 

25. Prove the last statement of section 5 for the case k — 1. 

26. Using (3.1), derive the probability of finding exactly m empty cells in the case 
of Bose-Einstein statistics. 

27. Verify that the formula obtained in 26 checks with Chapter 3, formula 
(5.3). 


4 In the notations of the calculus of differences (7.2) can be written (~l) n A w 0 r 
= 0 . 



CHAPTER 5 


CONDITIONAL PROBABILITY 
STATISTICAL INDEPENDENCE 


1. Conditional Probability 

The following example leads in a natural way to the notion of con¬ 
ditional probability. Suppose a population of N people includes Na 
colorblind people and Nh females. Let the events that a person chosen 
at random is colorblind and a female be A and H , respectively. Then 
(cf. the definition of random choice, Chapter 2, section 2) 

( 1 . 1 ) Pr{A] ^^ > Pr{H\=^r. 


Instead of the entire population, we may investigate the female sub¬ 
population and require the probability that a female chosen at random 
is colorblind. This probability is N ha/Nh , where N ha is the number 
of colorblind females. We have here no new notion, but we need a 
new notation to designate which particular subpopulation is under 
investigation. The most widely adopted symbol is Pr{A\H}; it may 
be read “the probability of the event A (colorblindness), assuming 
the event H (that the person chosen is female).” The formula 


( 1 . 2 ) 


PrlA | HI - ^ 


Pr{AH} 

Pr{H) 


suggests the following general definition whose usefulness and plausi¬ 
bility will be illustrated by further examples. 


Definition . Let H be an event with positive probability . 
trary event A we shall write 


( 1 . 3 ) 


Pr{A | H\ 


Pr{AH\ 


For an arbi- 


The quantity so defined will be called the conditional probability of A on 
the hypothesis H (or for given H). When all sample points have equal 
probabilities, Pr\A | H] is the ratio Nah/Nh of the number of.sample 
points common to A and H, to the number of points in H. 

* 78 



5 . 1 ] 


CONDITIONAL PROBABILITY 


79 


Conditional probabilities remain undefined when the hypothesis has 
zero probability. This is of no consequence in the case of discrete 
sample spaces, but is important in the general theory. 

Though the symbol Pr{A | H) itself is practical, its phrasing in 
words is so unwieldy that in practice less formal descriptions are used. 
Thus in our introductory example we referred to the probability of a 
female being colorblind instead of saying “the conditional probability 
of a person chosen at random being colorblind on the hypothesis that 
the person is a female.” Often the phrase “on the hypothesis H” is 
replaced by “if it is known that II occurred.” In short, our formulas 
and symbols are unequivocal, but phrasings in words are often informal 
and must be properly interpreted. 

Whenever convenient for stylistic clarity one speaks of absolute 
probabilities in contradistinction to conditional ones. Strictly speaking, 
the adjective “absolute” is redundant and will be omitted (as has been 
done in Chapter 1). » 

Examples, (a) An urn contains r red and b black balls. Two balls 
are chosen at random without replacements. If the first ball is red, 
what is the probability that the second ball is red also? The hypothesis 
H , “first ball red,” can occur in r(r + b — 1) ways; the event AH, 
“both balls red,” can occur in r(r — 1) ways. Therefore, Pr[A | II] 
= (r — l)/(r + b — 1 ). This formula expresses the fact that the 
second choice refers to an urn with r — 1 red and b black balls. 

( b ) Distribution of Sexes. Consider families with exactly two chil¬ 
dren. Letting b and g stand for boy and girl, respectively, and the 
first letter for the older child, we have four possibilities: bb, bg , gb } gg. 
These are the four sample points, and we associate probability 1/4 with 
each. Given that a family has a boy (event II), what is the probability 
that both children are boys (event A)? The event AH means bb, and 
II means bb, or bg, or gb . Therefore, Pr{A\ II] = 1/3: in about one- 
third of the families with the characteristic II we can expect that A 
also will occur. It is interesting that most people expect the answer 
to be 1/2. This is the correct answer to a different question, namely: 
a boy is chosen at random and found to come from a family with two 
children; what is the probability that the other child is a boy? The 
difference may be explained empirically. With our original problem 
we might refer to a card file of families, with the second to a file of 
males. In the latter, each family with two boys will be represented 
twice, and this explains the difference between the two results. 

(c) Bridge . If North has no ace (hypothesis H), what is the prob¬ 
ability that South has no ace either? Assuming all arrangements 



CONDITIONAL PROBABILITY 


[ 5.1 


equally probable, we find Pr{H | 


0-0 -"*•••• - 


event AH means “neither North nor South has an ace” and hence 


^'- 0000-00 

Therefore 

(X.4) I + 0-«**•••• 


Again this result could have been anticipated by the following reason¬ 
ing. North knows that four aces and 35 non-aces are divided among 
the remaining three players. His partner's hand can be selected in 
/39\ /35\ 

( j ways, of which * ea< ^ even ^ <no ace ” (Cf. problems 

3-5.) 

(d) We conclude with an example in which the sample space contains 
infinitely many points. Suppose we shoot at a target until it is hit for 
the first time. The theory of the next chapter will lead us to attribute 
probability (1 — p)p n ~ 1 to the event that exactly n trials are required; 
here p (the probability of a miss) is a constant with 0 < p < 1 (cf. also 
problem 7). In other words, our sample space consists of the points 
1, 2, 3, ... with corresponding probabilities (1 — p), (1 — p)p, 
(1 — p)p 2 , .... Assuming that the first trial results in failure (hy¬ 
pothesis H ), what is the probability that more than three trials will 
be necessary? Here Pr{H\ = 1 — (1 — p) = p. More than three 
trials are required only when the first results in failure. In our case, 
therefore, A = AH and Pr[AH } = (1 — p)p 3 {l + P + p 2 + • • •} 
= p 3 . Hence Pr{A | //} = p 2 . Among all cases which do not end 
with success at the first trial, those which are continued beyond the 
third trial should have an average frequency p 2 . 


Taking conditional probabilities of various events with respect to a 
particular hypothesis H amounts to choosing H as a new sample space; 
we have to multiply all probabilities by the constant factor 1 /Pr[H] 
in order to reduce the total probability of the new sample space to 
unity. This formulation shows that all general theorems on probabilities 
are valid also for conditional probabilities with respect to any particular 
hypothesis H. As an example we mention the fundamental relation 
for the probability of the occurrence of either A or B or both. We have 


(1.5) Pr{A U B | H } =, : Pr\A \ H) + Pr{B \ H\ - Pr{AB | H\. 


i*... 



5.2] 


COMPOUND EXPERIMENTS 


81 


Similarly, all theorems of Chapter 4 concerning probabilities of the 
realization of m among N events carry over to conditional probabilities, 
but we shall not need them. 

Formula (1.3) is often used in the form 

(1.6) Pr{AH } = Pr{A | H } -Pr[H }. 

This is the so-called theorem on compound probabilities. To gen¬ 
eralize it to three events A, B,C we first take H = BC as hypothesis 
and then apply (1.6) once more; it follows that 

(1.7) Pr{ABC\ = Pr{A | BC) - Pr\B | C) -Pr{C J. 

A further generalization to four or more events is straightforward. 

We conclude with a simple formula which is frequently useful. Let 
II i, • • •, II n be a set of mutually exclusive events of which one neces¬ 
sarily occurs (that is, the union of Hi, • • •, H n is the entire sample 
space). Then any event A can occur only jn conjunction with some IIj, 
or in symbols, 

(1.8) A = AHi U AII 2 U • • • U AH n . 

Since the AHj are mutually exclusive, their probabilities add. Apply¬ 
ing (1.6) to II = IIj and adding, we get 

(1.9) Pr{A\ = 2 Pr{A | Hj\ • Pr{Hj} . 

This formula is useful because an evaluation of the conditional prob¬ 
abilities Pr[A | IIj) is sometimes easier than a direct calculation of 
Pr{A). Examples will be found in the next section. 

2. Compound Experiments 

The use of conditional probabilities greatly simplifies formulations. 
In applications, many experiments are described in terms of condi- 
ditional probabilities (although the adjective “conditional” is usually 
omitted).* We shall give a few examples which will reveal a general 
scheme more effectively than a direct description could. 

Examples, (a) Families. We want to interpret the following state¬ 
ment. “The probability of a family with exactly k children is p* 
(where Po + Pi + • • • =1). For any family size all sex distributions 
have equal probabilities.” Letting b stand for boy and g for girl, our 
sample space consists of the points 0 (no children), 6, g, bb, bg, gb, gg, 
bbb, .... The second assumption in quotation marks can be stated 
more formally thus: if it is known that the family has exactly n children, 
each of the 2 n possible sex distributions has conditional probability 2“”, 



82 


CONDITIONAL PROBABILITY 


[5.2 


The probability of the hypothesis is p ny and we see from (1.6) that the 
absolute probability of any arrangement of n letters b and g is p n * 2““ n .» 

Let A stand for the event “the family has boys but no girls.” Its 

probability is obviously Pr{A) = pi-2"" 1 + p 2 -2~ 2 + ps-2^ 3 +_ 

Incidentally, this is a special case of (1.9). The hypothesis Hj in this 
case is “family has j children.” We now ask the question: if it is known 
that a family has no girls, what is the (conditional) probability that 
it has only one child? Here A is the hypothesis. Let H be the event 
“only one child.” Then AH means “one child and no girl,” and 


(2.1) Pr{H\A) = 


Pr{AH } 
Pr{A ) 


_Px2^_ 

Pi2 -1 + p 2 2~ 2 + P3%~ 3 + • • • 


(i b ) Mixed Populations . Suppose a human population consists of 
subpopulations or strata Hi, H 2 , .... These may be races, age groups, 
professions, etc. Let pj be the probability that an individual chosen 
at random belongs to Hj. Saying “the probability that an individual 
in Hj is left-handed is qj” is short for “the conditional probability of the 
event A (left-handedness) on the hypothesis that an individual belongs 
to Hj is qj” The probability that an individual chosen at random is 
left-handed is piqi + p 2 <?2 + P 3 Qs + ..which is again a special case 
of (1.9). Given that an individual is left-handed, the conditional 
probability of his belonging to stratum Hj is 


(2.2) Pr{H s \A\= - - 

Pi9i + PzQz + • • • 

(c) Polya’s Urn Scheme. An urn contains h black and r red balls. 
Random drawings are made. The ball drawn is always replaced, and, 
in addition, c balls of the color drawn are added to the urn. Here we 
are given conditional probabilities only. If the first ball is black, the 
(conditional) probability of a black ball at the second drawing is 
(6 + c)/(6 + c + r). The absolute probability of the sequence black, 
black is therefore, by (1.6), 


(2.3) 


b 6 + c 
6 + r 6 + c + r 


If the first two drawings result in black, then the urn contains b + r 
+ 2c balls among which b + 2c are black. The (conditional) probabil- 
• ity of a black ball at the third trial becomes, again using (1.6), 


(2.4) 


b + 2c 


6 + 2c + r 



5.2] 


COMPOUND EXPERIMENTS 


83 


In this way we can calculate all probabilities. It is easily seen that any 
sequence of n drawings resulting in n x black and n 2 red balls 
(ni + n 2 = n) has the same probability as the event of extracting 
first ni black and then n 2 red balls, namely, 

(2.5) p ni ,n = 

b(b + c)(b + 2c) • • • (b + n x c — c)-r(r + c) • • • (r + n 2 c — c) 

( b + r)(b + r + c)(b + r + 2c) • • • (b + r + nc — c) 

This scheme 1 was devised for the analysis of phenomena like con¬ 
tagious diseases, where the occurrence of certain events increases their 
future probabilities. More general probability models of this kind 
will be taken up later on. [Polya’s scheme is discussed in problems 
15-19 and again in Chapter 6, problems 33-35; Chapter 9, problem 
13; and Chapter 10, problem 9; cf. also Chapter 15, example (10.a), 
and Chapter 17, problems 5 and 6.] 

(i d ) Die A has four red and two white faces, whereas die B has two 
red and four white faces. The game starts by flipping a coin once: if 
it falls heads, the game continues by throwing die A alone; if it falls 
tails, die B is to be used. Here again we know conditional probabilities 
only. For example, the conditional probability of the sequence (red, 
red, white), assuming heads at the first trial, is (4-4-2)-6~ 3 . For 3 
throws of the die we have 16 sample points. Each of the points 
( H , R , R , R) and (T, W, W, W) has probability 4/27; the 6 points 
(ff, R, R, W), (//, R, W, R), (H, W, R, R\ (T, W, W, R), (T, W, R, W), 
(Tj jR, W , W) have probability 2/27 each; the 6 points obtained by 
interchanging W and R have probability 1/27; finally, (//, W, W y W ) 
and (T, R, R , R) have probability 1/54 each. 

Note that the probability of red at any trial is 1/2. If it is not 
known which die is used and at the first 2 throws red is observed, then 
the conditional probability of red at the third trial is 

(2 6 ) Pr{(g, R,R, R)}+Pr{(T , R, 7?, R) } _ 3 

K Pr{{H,R,R)\ +Pr{(T,R,R)\ 5 

(cf. problem 14). 

(e) The following example is famous and illustrative, but somewhat 
artificial. Imagine a population of N + 1 urns, each containing N red 
and white balls; the urn number k contains k red and N — k white balls 
(fc = 0, 1, 2, • • •, N). An urn is chosen at random and n random 
drawings are made from it, the ball drawn being replaced each time. 

1 F. Eggenberger and G. Polya, Uber die Statistik verkettcter Vorgange, Zeit- 
schrift fUr Angewandte Mathematik und Mechanik , vol. 3 (1923), pp. 279-289. 



84 


CONDITIONAL PROBABILITY 


[ 5.2 


Suppose that all n balls turn out to be red (event A). We seek the 
(conditional) probability that the next drawing will yield a red ball 
also (event B ). If the first choice falls on urn number k , then the 
probability of extracting in succession n red balls is ( k/N ) n . Hence, 
by (1.9), 


(2.7) 


Pr{A) = 


l n + 2 n + •.. + N n 
N n (N + 1). 


The event AB means that n + 1 drawings yield red balls, and therefore 


l"+i -i_ 2 n+1 -A -1- N n+l 

(2.8) Pr[AB\ = Pr{B\ -tt-- : - 

N n+1 (N + 1) 

The required probability is Pr[B | A ) = Pr{B}/Pr{A\. 

The sums in (2.7) and (2.8) can be considered Riemann sums approx¬ 
imating integrals, so that when N is large 


(2.9) 



1 

n + r 


We have therefore for large N approximately 

, 71+1 

(2.10) Pr[B\A • 

71 + 2 


This formula can be interpreted roughly as follows: if all compositions 
of an urn are equally probable, and if n trials yielded red balls, the 
probability of a red ball at the next trial is (n + 1 )/{n + 2). This is 
the so-called law of succession of Laplace (1812). 

Before the ascendance of the modern theory, the notion of equal 
probabilities was often used as synonymous for “no advance knowl¬ 
edge.” Laplace himself has illustrated the use of (2.10) by computing 
the probability that the sun will rise tomorrow, given that it has risen 
daily for 5000 years or n = 1,826,213 days. It is said that Laplace was 
ready to bet 1,826,214 to 1 in favor of regular habits of the sun, and 
we should be in a position to better the odds since regular service has 
followed for another century. A historical study would be necessary 
to render justice to Laplace and to understand his intentions. His 
successors, however, used similar arguments in routine work and 
recommended methods of this kind to physicists and engineers in cases 
where the formulas have no operational meaning. We would have to 
reject the method even if, for sake of argument, we were to concede 
that our universe was chosen at random from a collection in which all 
conceivable possibilities were equally likely. In fact, the assumed 



5 . 3 ] 


STATISTICAL INDEPENDENCE 


85 


rising of the sun on February 5, 3123 B.C., is by no means more certainij 
than that the sun will rise tomorrow. We believe in both for the same \ 
reasons. 


Note on Bayes’s Rule. In (2.1) and (2.2) we have calculated certain conditional 
probabilities directly from the definition. The beginner is advised always to do 
so and not to memorize the formula (2.12) which we shall now derive. It retraces 
in a general way what we did in special cases, but it is only a way of rewriting (1.3). 
We had a collection of events Hi, i/ 2 , ... which are mutually exclusive and exhaus¬ 
tive, that is, every sample point belongs to one, and only one, among the Hj, 
We were interested in 


( 2 . 11 ) 


Pr\H k \A\ 


PrjAHh] 
Pr\A\ ' 


If (1.6) and (1.9) are introduced into (2.11), it takes the form 


( 2 . 12 ) 


Pr\H k \A) 


Pj\A | fffc} Prjlh] 

E Pr U I Hj) Pr { Hj ( ' 


If the events Hk are called causes, then (2.12) becomes “Bayes's rule for the proba¬ 
bility of causes." Mathematically, (2.12) is a special way of writing (1.3) and 
nothing more. The formula is useful in many statistical applications of the type 
described in the above examples (a-d). Unfortunately, Bayes's rule has been some¬ 
what discredited by metaphysical applications of the type described in example (e). 
In routine practice this kind of argument can be dangerous. A quality-control 
engineer is concerned with one particular machine and not with an infinite popula¬ 
tion of machines from which one was chosen at random. He has been advised to 
use Bayes’s rule on the grounds that it is logically acceptable and corresponds to 
our way of thinking. Plato used this type of argument to prove the existence of 
Atlantis, and philosophers used it to prove the absurdity of Newton’s mechanics. 
In our case it overlooks the circumstance that the engineer desires success and that 
he will do better by estimating and minimizing the sources of various types of 
errors in prediction and guessing. The modern method of statistical tests and 
estimation is less intuitive but more realistic. It may be not only defended but 
also applied. 


3. Statistical Independence 

In the above examples the conditional probability Pr{A | H] gen¬ 
erally does not equal the absolute probability Pr{A}. Popularly 
speaking, the information as to whether H has occurred changes our 
way of betting on the event A. Only when Pr{A | H) = Pr{A], this 
information does not permit any inference as to the occurrence of^/TT 
In this case we shall say that A is statistically independent of H. 
Now (1.6) shows that the condition Pr{A | H\ = Pr{A } can be written 
in the form 


( 3 . 1 ) 


Pr\AH\ « Pr{A\-Pr{H\. 



86 


CONDITIONAL PROBABILITY 


[6.3 


This equation is symmetric in A and H y and shows that whenever A is 
statistically independent of H so is H of A. It is therefore preferable 
to start from the following symmetric 

Definition 1 . Two events A and H are said to be statistically inde¬ 
pendent (or independent , for short) if equation (3.1) holds. This definition 
is accepted also if Pr{H\ = 0, in which case Pr{A | H\ is not defined. 

Examples, (a) A card is chosen at random from a deck of playing 
cards. For reasons of symmetry we expect the events “spade” and 
“ace” to be independent. As a matter of fact, their probabilities are 
1/4 and 1/13, and the probability of their simultaneous realization is 
1/52. 

(b) Two true dice are thrown. We verify that the events “ace with 
first die” and “even face with second” are independent; the probability 
of their simultaneous realization, 3/36 = 1/12, is the product of their 
probabilities, namely 1/6 and 1/2. 

(c) In a random permutation of the four letters (a, 6, c, d) the events 
“a precedes 6” and “c precedes d” are independent. This is intuitively 
clear and easily verified. 

(d) Sex Distribution. We return to example (1.6) but now consider 
families with three children. We assume that each of the eight possi¬ 
bilities 666, 66*7, •••, ggg has probability 1/8. Let H be the event 
“the family has children of both sexes,” and A the event “there is at 
most one girl.” Then Pr{H) = 6/8, and Pr[A} — 4/8. The simul¬ 
taneous realization of A and H means one of the possibilities bbg y bgb y 
gbb y and therefore Pr[AH) = 3/8 = Pr[A) • Pr{H\ . Thus in families 

<1 with three children the two events are independent. Note that this 
''/is not the case for families with two or four children. This shows that 
jit is not always obvious whether or not we have independence. 

If H occurs, then the complementary event E does not occur, and 
vice versa. Statistical independence implies that no inference can be 
drawn from the occurrence of H to that of A; therefore statistical 
independence of A and H should mean the same as independence of A 
and E (and, because of symmetry, also of A and //, and of A and E). 

, This assertion is easily verified, using the relation Pr{E\ = 1 — Pr[H\. 
If (3.1) holds, then (since AE = A — AH) 

y 

(3.2) Pr{AR\ = Pr{A\ - Pr{AH } = Pr{4} - Pr\A) • Pr\H\ 

= Pr{A)-Pr{R\,S 


as expected. 




5.3] 


STATISTICAL INDEPENDENCE 


87 


Suppose now that three events A, B, and C are pairwise independent 
so that 

(3.3) Pr{AB\ = Pr{A}-Pr{B } •' 

Pr{AC } = Pr{A}-Pr{C\ S 
Pr{BC\ = Pr{B\-Pr{C\.y 

One might think that this always implies the independence of such 
pairs of events as AB and C. Unfortunately this is not necessarily so. 
We shall exhibit an example in which (3.3) is true but the simultaneous 
occurrence of A, B, and C is impossible, so that AB and C cannot be 
independent. 

Example, (e) Two dice are thrown and three events are defined as 
follows. A means “odd face with first die”; B means “odd face with 
second die”; finally, C means “odd sum” (one face even, the other odd). 
If each of the 36 sample points has probability 1/36, then any two of 
the events are clearly independent. The probability of each is 1/2, 
and so is its conditional probability, assuming that one of the other 
two events has occurred. Nevertheless, th e three events cannot occur 
simultaneously. The information that A but not B has occurred 
assures that C has occurred, and a similar statement holds for all other; 
combinations. 

It is desirable to reserve the term statistical independence for the 
case where no such inference is possible. For this it is necessary that 

(3.3) holds, but we must in addition assume that 

/ 

(3.4) Pr[ABC) = Pr{A } Pr[B\ PrfC}. 

This equation insures that A and BC are independent and also that 
the same is true of B and AC, and C and AB. Furthermore, it can 
now be proved also that A U B and C are independent. In fact, by 
the fundamental relation (6.4) of Chapter 1 we have 

(3.6) Pr{(A U B)C\ = Pr{AC\ + Pr{BC) - Pr{ABC }. 

Now, applying (3.3) and (3.4) to the right side, we can factor out Pr\C\. 
The other factor is Pr{A\ + Pr{B} — Pr{AB] = Pr[A U B] so that 

(3.6) Pr{(A U P)C} = Pr{(A U B)} Pr{C}. 

This makes it plausible that the conditions (3.3) and (3.4) together 
suffice to avoid embarrassment; any event expressible in terms of A 
and B will be independent of C. 



88 CONDITIONAL PROBABILITY [5.4 

In the general case of n events the following definition proves satis¬ 
factory. 

Definition 2. The events A\, A 2 , • • •, A n are called mutually independ¬ 
ent if for all combinations 1 < i < j < k < • • • < n the multiplication 
rules 

Pr{AiAj} = Pr[Ai\ Pr{Aj } 

Pr{AiAjA k } = Pr{Ai\ Pr{Aj} Pr{A k \ 

(3.7) . 


Pr{A 1 A 2 ••• A n ] = Pr[AA Pr\A 2 ) • • • Pr[A n \ 


apply. 


The first line stands for ^ ) equations, the second for (^y , etc. We 

hav e , therefore, (”) + (”)+...+ (“). (1 + 1,- - - (") 

= 2 n — n — 1 conditions which must be satisfied. On the other hand, 

conditions stated in the first line suffice to insure pairwise 

independence. The whole system (3.7) looks like a complicated set of. 
conditions, but it will soon become apparent that its validity is usually* 
obvious and requires no checking. The distinction between mutual 1 
and pairwise independence is of theoretical rather than practical 
interest. Practical examples of pairwise independent events which are 
not mutually independent apparently do not exist. The possibility 
of such an occurrence was discovered by S. Bernstein. 


4. Repeated Trials 

The notion of statistical independence finally enables us to formulate 
analytically the intuitive concept of experiments “repeated under 
identical conditions.” 

(Wdkler the sample space @ representing a certain conceptual 
experiment. Let the sample points be 2?i, E 2 , ... and denote their 
probabilities by pi, p^ .... The possible results of a succession of two 
similar experiments are the pairs (Ej, Ek) and they form a new sample 
space. In it probabilities can be assigned in many ways. However, if 
the experimentalist says that two measurements are performed under 






5.4] 


REPEATED TRIALS 


identical conditions, he implies independence: the first outcome should 
have no influence on the second. This means^tKaT'the two events 
“first outcome is E” and “second outcome is 2?&” should be statistically 
independent or that 

(4.1) Pr{E h E k \ = pjp k . 

This equation assigns a probability to every pair (Ej y Ek) of possible 
outcomes. Before we can use (4.1) as a definition of probabilities in 
the new sample space, we must show that the quantities pfpk add to 
unity. Now, in the sum 22 pjpk each term appears once, and only 
once, so that 2 'Zpjpic = (pi + P 2 + • • -)(pi + P 2 + • • •) = 1- Hence 

(4.1) is acceptable as a definition of probabilities. S 

Let A and B be two arbitrary events in the original sample space ©. 
We denote the event “A occurred at first trial and B at second” by 
(A y B). Suppose A contains the points E av E av ... and B the points 
Eb v E b%y .... Then (A, B) is the union of all pairs (E ap E bk ), and as 
before we see that 

(4.2) Pr[(Ay B)\ = 22 p aj p bk = (2p rt/ )(2p fejfc ) 

- Pr{A\Pr{B). 

Hence the events A and B are independent. We see that the definition 
(4.1) entails that all events at the second trial are independent of 
events at the first trial. For the purposes of probability theory this 
describes “identical experiments.” 

These considerations obviously also apply to a succession of r experi¬ 
ments and lead to the 

Definition . Let © be a sample space with sample points E\ y E 2 , ... 
and corresponding probabilities p iy p 2y ... . By r independent trials 
corresponding to © we mean the sample space whose points are the r-tuples 
(Ej v Ej v ..., Ej r ) to which the probabilities 

(4.3) Pr{(E h , E kt ..., E jr )} = p h p h ... p it 
are assigned. 

In other words, each point of the new space is a sample of size r 
(with possible repetitions) of points of the original space, and prob¬ 
abilities are defined by the multiplication rule (4.3). The reader is 
reminded that (4.3) is not the only possible definition of probabilities. 
In other words, repeated trials are not necessarily independent. For 
example, in sampling without replacement we are concerned with 



90 CONDITIONAL PROBABILITY [5.4 

dependent trials. Equation (4.3) defines independent trials or, in physi¬ 
cal terms, trials repeated under identical conditions. 

The argument which led to (4.2) shows more generally the truth of 
the following 

Theorem . Suppose that a system of events A Xy A 2 , • • •, A r is such that 
the jth trial alone decides whether or not Aj occurs ; then the events A Xy 
• • •, A r are mutually independent if the trials are independent , that is y 
if (4.3) holds . 

If © contains a finite number, N y of points, then there are N r sample 
points (E jv • • •, E jr ). If each point of © has probability 1/N y then 

(4.3) assigns probability N~ r to each point (E jv • * •, E Jr ). The new 
approach is conceptually preferable to a formal assignment of equal 
probabilities because it applies to sample spaces with unequal prob¬ 
abilities and also to infinite sample spaces. It is indispensable for the 
general theory of probability where we consider even a single trial as 
the first in a potentially infinite sequence. We are then dealing only 
with infinite sequences (E Jiy Ej 2y ...) of possible outcomes, and in this 
new space probabilities are defined in a way consistent with (4.3). 
Unfortunately this leads beyond the theory of discrete sample spaces, 
to which the present volume is restricted. We have a more elementary 
theory but pay for it by the necessity of changing the sample space 
according to the number of trials. 

In the preceding discussion we have considered only repetitions of 
the same experiment, but successions of unlike experiments can be 
treated in the same way. If we first toss a coin, then throw a die, we 
naturally assume that the two experiments are independent. This 
amounts to assigning probabilities by the product rule. Thus 
Pr\ (heads, ace)} = 1/2*1/6, etc. In this particular case this is 
equivalent to assigning equal probabilities to all twelve sample points, 
but in general we must proceed as in (4.3). Let ©' and ©" be two 
sample spaces and denote their points by E Xy E 2y ... and E x f y 
E 2 ", .... Let the corresponding probabilities be p Xy p 2y ... and 
Pi", P 2 ", • • •• The succession of the two experiments is described by 
the space with points (E/ y Ek"). Saying that the two experiments are 
independent means defining probabilities by 

(4.4) Pr { (E/y Ek ")} = P/Pk". 

Examples, (a) Permutations . We have considered the n! permuta¬ 
tions of a Xy a 2 , * * *, a n as points of a sample space and attributed prob¬ 
ability 1/nl to each. We may consider the same sample space as 



5.4a] 


A GUIDE TO ABSTRACT LANGUAGE 


91 


representing n — 1 successive experiments as follows. Begin, by writing 
down a\. The first experiment consists in putting a 2 either before or 
after a\. This done, we have three places for a 3 and the second experi¬ 
ment consists of a choice among them. This decides on the relative 
order of a x , a 2 , and a 3 . Now we have four places to choose from for a 4 . 
In general, when a 1} • • •, are put into some relative order, we proceed 
with experiment number fc, which consists in selecting one of the k + 1 
places for a& +1 . As an example, take n = 5. The permutation 
(a 4 , a 2y a\ } a 5 , a 3 ) is built up successively by choosing for a 2 , a 3 , a 4 , 
and d 5 the first, last, first, and fourth of the available places. In other 
words, we have a succession of n — 1 experiments of which the fcth can 
result in k different choices (sample points), each having probability 
1/k. The experiments are independent, that is, the probabilities are 
multiplicative. Each permutation of the n elements has probability 
1/2* 1/3 • • • 1/n, in accordance with the original definition. 

(i b ) Sampling without Replacement. Let the population be (a u 
• • •, a n ). In sampling without replacement each choice removes an 
element. After k steps there remain n — k elements, and the next 
choice can be described by specifying the number v of the place of the 
element chosen (v = 1, 2, • • •, n — k). In this way the taking of a 
sample of size r without replacement becomes a succession of r experi¬ 
ments where the first has n possible results, the second n — 1, the third 
n — 2, etc. We attribute equal probabilities to all results of the indi¬ 
vidual experiments and postulate that the r experiments are independ¬ 
ent. This amounts to attributing probability l/(n) r to each sample in 
accordance with our definition of random samples. [Note that for 
n = 100, r = 3, the sample (a i3 , a 40 , a 81 ) means choices number 13, 39, 
79, respectively. We must say that at the third experiment the seventy- 
ninth element of the reduced population of n — 2 was chosen, for with 
the original numbering the outcomes of the third experiment would 
depend on the first two choices.] We see that the notion of repeated 
independent experiments permits us to study sampling as a succession 
of individual operations. 

4a. A Guide to Abstract Language 

The notions with which probability theory deals occur also in other branches 
of mathematics. Sample space is simply an abstract space in which a probability 
measure is defined. The term repeated trials refers to combinatorial product spaces 
with congruent component spaces; a measure is defined on the product space and 
induces measures on the component spaces. Independence of trials means product 
measure . Saying that event A depends only on trial number k is an abbreviation 
for “A is a cylindrical set with base in the &th component space.’’ The phrase 
“if it is known that A occurred” is a translation of “if xe A,” where x stands for a 



92 CONDITIONAL PROBABILITY [5.5 

point in sample space. Successive experiments refer to product spaces with different 
component spaces, and “independent” again refers to product measure. 


* 5. Applications to Genetics 

The theory of heredity, originated by G. Mendel (1822-1884), pro¬ 
vides instructive illustrations for the applicability of simple probability 
models. We shall restrict ourselves to indications concerning the most 
elementary problems. In describing the biological background, we 
shall necessarily oversimplify and concentrate on such facts as are 
pertinent to the mathematical treatment. 

Heritable characters depend on special carriers, called genes. All 
cells of the body, except the reproductive cells or gametes, carry exact 
replicas of the same gene structure. The salient fact is that genes 
appear in pairs. The reader may picture the genes as a vast collection 
of beads on short pieces of string, the chromosomes. These appear in 
pairs so that the two genes of a pair occupy similar positions on two 
related chromosomes. In the simplest case each gene of a particular 
pair can assume two forms (alleles), A and a. Then three different pairs 
can be formed, and, with respect to this particular pair, the organism 
belongs to one of the three genotypes AA, Aa, aa (there is no distinction 
between Aa and aA). For example, peas carry a pair of genes such 
that A causes red, and a causes white, blossom color. The three geno¬ 
types are in this case distinguishable as red, pink, and white. Each 
pair of genes determines one heritable factor, but the majority of 
observable properties of organisms depend on several factors. For some 
characteristics (e.g., eye color, and left-handedness) the influence of 
one particular pair of genes is predominant, and in such cases the effects 
of Mendelian laws are readily observable. Other characteristics, such 
as height, can be understood as the cumulative effect of a very large 
number of genes [cf. Chapter 10, example (5.c)]. Here we shall study 
genotypes and inheritance for only one particular pair of genes with 
respect to which we have the three genotypes A A, Aa, aa. Frequently 
there are N different forms A i, • • •, Ajy for the two genes and, accord¬ 


ingly, 



genotypes AiAi, 


A i A 2 , 


• • *, AnAn. 


The theory 


applies to this case with obvious modifications (cf. problem 21). The 
following calculations apply also to the case where A is dominant and 
a recessive. By this is meant that Aa-individuals have the same ob¬ 
servable properties as A A, so that only the pure aa-type shows an 
observable influence of the a-gene. All shades of partial dominance 


* Starred sections treat special topics and may be omitted at first reading. 



5.5] 


APPLICATIONS TO GENETICS 


93 


appear in nature. Typical partially recessive properties are blue eyes, 
left-handedness, etc. 

The reproductive cells, or gametes, are formed by a splitting process 
and receive one gene only. Organisms of the pure AA- and aa-geno- 
types (or homozygotes) produce therefore gametes of only one kind, but 
Aa-organisms (hybrids or heterozygotes) produce A- and a-gametes in 
equal numbers. New organisms are derived from two parental gametes 
from which they receive their genes. Therefore each pair includes a 
paternal and a maternal gene, and any gene can be traced back to one 
particular ancestor in any generation, however remote. 

The genotypes of offspring depend on a chance process. At every 
occasion, each parental gene has probability 1/2 to be transmitted, 
and the successive trials are independent. In other words, we conceive 
of the genotypes of n offspring as the result of n independent trials, 
each of which corresponds to the tossing of two coins. For example, 
the genotypes of descendants of an Aa X Aa pairing are AA, Aa, aa 
with respective probabilities 1/4, 1/2, 1/4. An AA X aa union can 
have only Aa-offspring, etc. 

Looking at the population as a whole, we conceive of the pairing of 
parents as the result of a second chance process. We shall investigate 
only the so-called random mating , which is defined by this condition: 
if r descendants in the first filial generation are chosen at random, then 
their parents form a random sample of size r , with possible repetitions, 
from the aggregate of all possible parental pairs. In other words, each 
descendant is to be regarded as the product of a random selection of 
parents, and all the selections are mutually independent. Random 
mating is an idealized model of the conditions in many natural popu¬ 
lations and in many field experiments. However, if red peas are sown 
in one corner of the field and white peas in another, parents of like color 
will unite more often than under random mating. Preferential selec¬ 
tivity (such as blonde preferring blondes) violates the condition of 
random mating. Extreme non-random mating is represented by self¬ 
fertilizing plants and artificial inbreeding. Such assortative mating 
systems have been analyzed mathematically, but we shall restrict our 
attention mainly to random mating. 

The genotype of an offspring is the result of four independent random 
choices. The genotypes of the two parents can be selected in 3 - 3 ways, 
their genes in 2*2 ways. However, we may combine two selections 
and describe the process as one of double selection thus: the paternal 
and maternal gene are each selected independently and at random from 
the population of all genes carried by males or females of the parental 
population. 



94 


CONDITIONAL PROBABILITY 


[5.5 


Suppose that the three genotypes A A, A a, aa occur among males 
and females in the same ratios, u:2v:w. We shall suppose u + 2v + w 
— 1 and call u, 2v, w, the genotype frequencies . Put 

(5.1) p = w + i;, q = v + w. 

Clearly the numbers of A- and a-genes are as p:g, and since p + q = 1 
we shall call p and q the gene frequencies of A and a. In each of the 
two selections an A-gene is selected with probability p, and, because of 
the assumed independence, the probability of an offspring being AA 
is p 2 . The genotype Aa can occur in two ways, and its probability is 
therefore 2pq. Thus, under random mating conditions an offspring 
belongs to the genotypes A A, A a, or aa with probabilities 

(5.2) ui = p 2 , 2vi = 2pq, w t = <f. 

Examples, (a) All parents are A a (heterozygotes). Then u = w 
= 0, 2v = 1, and p = q = 1/2; (6) A A- and aa-parents are mixed in 
equal proportions. Then u = w = 1/2, v = 0, and again p = g = 1/2; 
(c) w = w = 1/4, 2v = 1/2. Again p = q = 1/2. In all three cases 
we have for the filial generation u x = 1/4, 2v x = 1/2, Wi = 1/4. 

For a better understanding of the implications of (5.2) let us fix the 
gene frequencies p and q (p + q = 1) and consider all systems of 
genotype frequencies u , 2v, w for which u + v = p and v + w = q. 
They all lead to the same probabilities (5.2) for the first filial generation. 
Among them there is the particular distribution 

(5.3) u = p 2 , 2v = 2 pq ) w = g 2 . 

If the frequencies w, v , w in the original generation stand in the par¬ 
ticular relation (5.3) (as in example c), then we find for the genotype 
probabilities in the first filial generation u x = w, v x = v , and w x = w. 
Therefore we call genotype distributions of the form (5.3) stable. To 
every ratio p:q there corresponds a stable distribution. 

Equations (5.2) give the genotype probabilities for a randomly 
selected individual of the second generation. In a large population 
we must expect the actual genotype frequencies to be very close to 
the theoretical distribution. 2 Now, whatever the distribution u:2v:w 
in the parental generation, equations (5.2) define a stable distribution; 
in it the genes A and a appear with frequencies [cf. (5.1)] u x + v x 
= u + v = p and v x + w x = v + w = g. In other words, if the 

•Without this our probability model would be void of operational meaning. 
The statement is made precise by the law of large numbers and the central limit 
theorem, which permits us to estimate the effect of chance fluctuations. 



5.5] 


APPLICATIONS TO GENETICS 


95 


observed frequencies coincided exactly with the calculated probabili¬ 
ties, then the first filial generation would have a stable genotype 
distribution which would perpetuate itself without change in all 
succeeding generations. In practice, deviations will be observed, but 
for large populations we can say: whatever the composition of the parent 
population may be, random mating will within one generation produce 
approximately a stable genotype distribution with unchanged gene fre¬ 
quencies . From the second generation on, there is no tendency towards 
a systematic change: a steady state is reached with the first filial gen¬ 
eration. This was first noticed by G. II. Hardy* who thus resolved 
assumed difficulties in Mendelian laws. It follows in particular that 
under conditions of random mating the frequencies of the three gen¬ 
otypes must stand in the ratios p 2 : 2pq : q 2 . This can in turn be 
used to check the assumption of random mating. 

Hardy also pointed out that emphasis must be put on the word 
“approximately.” Because of chance fluctuations the actual genotype 
frequencies will never coincide exactly with the theoretical probabilities 
(5.2). Therefore, even with a stable distribution we must expect small 
changes from generation to generation. This leads us to the following 
picture. Starting from any parent population, random mating tends 
to establish the stable genotype distribution (5.3) within one generation. 
For a stable distribution there is no tendency towards a systematic 
change of any kind. However, chance fluctuations will change the 
gene frequencies p and q from generation to generation, and the genetic 
composition will slowly drift. There are no restoring forces seeking to 
reestablish original frequencies. On the contrary, our simplified model 
leads to the conclusion [cf. Chapter 15, example 2, X] that, whenever a 
population is bounded in size, one gene should ultimately die out, so 
that the population should eventually belong to one of the pure types, 
AA or aa. In nature this does not occur because of the creation of 
new genes by mutations, selections, and many other effects. These 
more complicated processes of evolution can be studied by more refined 
mathematical tools (Markov chains, diffusion theory). 

Hardy’s theorem is frequently interpreted to imply a strict stability 
for all times. It is a common fallacy to believe that the law of large 

8 G. H. Hardy, Mendelian Proportions in a Mixed Population, Letter to the 
Editor, Science } N.S., vol. 28 (1908), pp. 49-50. Anticipating the language of 
Chapters 9 and 15, we can describe the situation as follows. The frequencies of 
the three genotypes in the nth generation are three random variables whose ex¬ 
pected values are given by (5.2) and do not depend on n. Their actual values will 
vary from generation to generation and form a stochastic process of the Markov 
type. 



96 


CONDITIONAL PROBABILITY 


[5.6 


numbers acts as a force endowed with memory seeking a return to the 
original state, and many wrong conclusions have been drawn from this 
assumption. (The biological processes here considered are typical of 
the important class of Markov processes which will be studied in detail 
in Chapter 15.) 

It should also be noted that Hardy’s law does not apply to the 
distribution of two pairs of genes (e.g., eye color and left-handedness). 
With respect to two pairs we have to distinguish nine genotypes 
AABB, AABb, • • •, aabb. There is still a tendency towards a stable 
distribution, but equilibrium is not reached in the first generation 
(cf. problem 25). 

* 6. Sex-linked Characters 

In the introduction to the preceding section it was mentioned that 
genes lie on chromosomes. These appear in pairs and are transmitted 
as units, so that all genes on a chromosome stick together. 4 Our 
scheme for the inheritance of genes therefore applies also to chromo¬ 
somes as units. Sex is determined by two chromosomes; females are 
XX, males XY. The mother necessarily transmits an X-chromosome, 
and the sex of offspring depends on the chromosome transmitted by the 
father. Accordingly, male and female gametes are produced in equal 
numbers. The difference in birth rate for boys and girls is explained 
by variations in prenatal survival chances. 

It has been said that both geines and chromosomes appear in pairs, 
There is an exception inasmuch as the genes situated on the X-chromo- 
some have no corresponding gene on Y. Females have two X- 
chromosomes, and hence two of such X-linked genes; however, in males 
the X-genes appear as singles. Typical are two sex-linked genes 
causing colorblindness and haemophilia. With respect to each of 
them, females can still be classified into the three genotypes, AA, Aa, 
aa, but, having only one gene, males have only the two genotypes A 
and a. Note that a son always has the father’s F-chromosome so that a 
sex-linked character cannot be inherited from father to son. However, 
it can pass from father to daughter and from her to a grandson. 

We now proceed to generalize the analysis of the preceding section. 
Assume again random mating and let the frequencies of the genotypes 
AA, Aa, aa in the female population be u, 2v, w, respectively. As before 
put p = u + v, q = v + w. The frequencies of the two male genotypes 
A and a will be denoted by p' and q ' (p' + q' = 1). Then p and p' 

* Starred sections treat special topics and may be omitted at first reading. 

4 This picture is somewhat complicated by occasional breakings and recombina¬ 
tions of chromosomes, cf. Chapter 2, section 8, problem 11. 



5.0] 


SEX-LINKED CHARACTERS 


97 


are the frequencies of the A-gene in the female and male populations, 
respectively. The probability for a female descendant to be of geno¬ 
type A A , Aa , aa will be denoted by u\, the analogous prob¬ 

abilities for the male types A and a are pi, q \. Now a male offspring 
receives his X-chromosome from the female parent, and hence 

(6.1) pi = p, qi = q. 

For the three female genotypes we find, as in section 5, 

(6.2) Mi = pp', 2vi = pq’ + qp', w x = qq'. 


Hence 

(6.3) 


p + p' 

Pi = u x + Vy = —-— 


. q + q' 

qi = Vi + u>i = —-— 


We can interpret these formulas as follows. Among the male de¬ 
scendants the genes A and a appear approximately with the frequencies 
p, q of the maternal population; the gene frequencies among female 
descendants are approximately p\ and r/i, or half-way between those 
of the paternal and maternal populations. We discern a tendency 
towards equalization of the gene frequencies. In fact, from (6.1) and 

(6.3) we get 

, v-v' , q-q' 

(6.4) pi' - Pi= —-— , qi - qi = —-- 


This means that random mating will in one generation reduce approxi¬ 
mately by one-half the differences between gene frequencies among 
males and females. However, it will not eliminate the differences, and 
a tendency towards further reduction will subsist. In contrast to 
Hardy’s law, we have here no stable situation after one generation. 
We can pursue the systematic component of the changes from genera¬ 
tion to generation by neglecting chance fluctuations and identifying 
the theoretical probabilities (6.2) and (6.3) with corresponding actual 
frequencies in the first filial generation. 5 For the second generation we 
obtain by the same process 


(6.5) 


P2 


Pi + Pi' _ 3p / 

2 4 + 4 ’ 


?2 = 


Qi + qi 
2 


4 4 


and, of course, p% = Pi, = <?i- A few more trials will lead to the 

5 In the terminology introduced in footnote 3 we can interpret p n and q n as the 
expected values of the gene frequencies in the nth female generation. With this 
interpretation the formulas for p n and q n are no longer approximations but exact. 



CONDITIONAL PROBABILITY 


98 


I5.fr 


general expression for the probabilities p n and q„ among females of 
the nth descendant generation. Put 


( 6 . 6 ) 


2 p + p' 
a — - 

3 


, 0 = 


2 g + g' 
3 


(Note that a + /3 = 1.) Then 

Pn—l + p'n-i 


Pn 


= « + (-!)» 


(6.7) 


P ~ P 
3-2" ’ 


q n = 


Qn-l+q'n-l „ , , q - q' 


= /3 + (-l) n 


3*2 n ' 


and Pn' = Pn.— 1 > gn' = q n - 1 - Hence 

( 6 . 8 ) Pn->«, Vn'->OL, g n -> 0 , 

The genotype frequencies in the female population are given by (6.2) or 

(6.9) W n = Pn-lP'n-l, 2f>„ = Pn-ig'n-l + gn-lP'n-1, 

= (?n—1(? n—1* 


Hence 


(6.10) W n —> a 2 , 2y n —> 2a$, W n —> 0 2 . 

These formulas show that there is a strong systematic tendency, 
from generation to generation, towards a state where the genotypes 
A and a appear among males with frequencies a and /3, and the female 
genotypes AA , Aa , aa have probabilities a 2 , 2a/3, 0 2 , respectively. The 
convergence is very fast, as indicated by (6.7). In practice, equilibrium 
will be reached after three or four generations. To be sure, small 
chance fluctuations will be superimposed on the described changes, 
but these represent the prevailing systematic tendency. 

Our main conclusion is that under random mating we can expect the 
sex-linked genotypes A and a among males, and AA } Aa } aa among 
females to occur approximately with the frequencies a, f3, a 2 , 2a0, /3 2 , 
respectively, where a + P = 1. 

Application. Many sex-linked genes, like colorblindness, are recessive 
and cause defects. Let a be such a gene. Then all a-males and all 
aa-females show the defect. Females of Aa-type may transmit the 
defect to their offspring, but are not themselves affected. Hence we 
expect that a recessive sex-linked defect which occurs among males with 
frequency a occurs among females with frequency a 2 . If one man in 100 
is colorblind, one woman in 10,000 should be affected. 



5.7] 


SELECTION 


99 


* 7. Selection 

As a typical example of the influence of selection we shall investigate 
the case where aa-individuals cannot multiply. This happens when 
the a-gene is recessive and lethal, so that aa-individuals are born but 
cannot survive. Another case occurs when artificial interference by 
breeding or laws prohibits mating of aa-individuals. 

Assume random mating among AA- and Aa-individuals, but no 
mating of aa-types. Let the frequencies with which the genotypes 
AA, Aa } aa appear in the total population be u, 2v , w . The correspond¬ 
ing frequencies for parents are then 

u 2v 

(7.1) u* = -, 2v* = -, w* = 0. 

1 — w l — w 


We can proceed as in section 5, but have to use the quantities (7.1) 
instead of u , 2v> w. Hence, (5.1) is to be replaced by 


(7.2) 


V = 


u + v 

, 

1 — w 


* V 



The probabilities of the three genotypes in the first filial generation are 
again given by (5.2) or u x = p 2 , 2v x = 2 pq, w x = q 2 . 

As before, in order to investigate the systematic changes from 
generation to generation, we have to replace u } v , w by ?q, v u w x and 
thus obtain probabilities u 2f v 2 , w 2 for the second descendant genera¬ 
tion, etc. In general we get from (7.2) ■ 


u n + v n v , 

Pn — " ) Qn — 


1 ~ W n 


1 ~ w n 


(7.3) 
and 

(7.4) tt n +i == p n 2 , 2v n+1 = 2 p n q ni w n+x = q 2 . 
A comparison of (7.3) and (7.4) shows that 

U n +1 + V n +i p n 1 


(7.5) Vn+\ — 
and similarly 

(7.6) 1 — 


1 - w n +i 1 - q n 2 1 + q n 

Vn +1 qn 


1 — w n +i 1 + q n 

From (7.6) we can calculate q n explicitly. In fact 


(7.7) 


1 1 

-= 1 + - 

Qn+l Qn 


* Starred sections treat special topics and may be omitted at first reading. 



100 


CONDITIONAL PROBABILITY 


[5.8 


whence successively 

1 11 11 1 1 1 

(7.8) - = 1 +-, — = 2 H—, — = 3 + -, •••, — = n H— 

Qi q q 2 q qs q q n q 

or 

q / q \ 2 

(7.9) q n = —-, w n + 1 = ( —-j • 

1 + nq \1 + gw 

We see that the unproductive (or undesirable) genotype gradually 
drops out, but the process is extremely slow. For q = 0.1 it takes ten 
generations to reduce the frequency of a-genes by one-half; this reduces 
the frequency of the aa-type approximately from 1 to 34 per cent. 
(If a is sex-linked, the elimination proceeds much faster as shown in 
problem 23; for a generalized selection scheme cf. problem 24. 6 ) 

8. Problems for Solution 

1. Three dice are rolled. If no two show the same face, what is the probability 
that at least one is an ace? 

2. Given that a throw with 10 dice produced at least one ace, what is the 
probability p of two or more aces? 

3. Bridge . In a bridge party West has no ace. What probability should be 
attributed to the event of his partner having two or more aces? [Cf. example (1 .c).] 

4. Bridge . North and South have ten trumps between them (trumps being 
cards of a specified suit). Find the probability that all three remaining trumps 
are in the same hand (either East or West has no trumps). 

5. Continuation. If it is known that the king of trumps is included among 
the three, what is the probability that he is “unguarded” (that is, one player has 
the king, the other the remaining two trumps)? 

6. In a bolt factory machines A, B, C manufacture, respectively, 25, 35, and 
40 per cent of the total. Of their output 5, 4, and 2 per cent are defective bolts. 
A bolt is drawn at random from the produce and is found defective. What are the 
probabilities that it was manufactured by machines A, B, C? 

7. In example (l.of) suppose that an even number n of shots was fired. What 
is the probability that n = 2? 

8. Suppose that 5 men out of 100 and 25 women out of 10,000 are colorblind. 
A colorblind person is chosen at random. What is the probability of his being 
male? (Assume males and females to be in equal numbers.) 

9. Let 7 the probability p n that a family has exactly n children be ap n when 
n > 1, and po = 1 — «p( 1 + p + V 2 +• • •)• Suppose that all sex distributions 
of n children have the same probability. Show that for k > 1 the probability 

6 For a further analysis of various eugenic effects (which are frequently different 
from the ideas of enthusiastic proponents of sterilization laws) cf. G. Dahlberg, 
Mathematical Methods for Population Genetics , New York and Basel, 1948. 

7 According to A. J. Lotka, American family statistics satisfies our hypothesis 
with p = 0.7358. Cf. Th^orie analytique des associations biologiques II, Actu- 
aliUs scientifique et industridles, no. 780, Paris, 1939. 



5.8] PROBLEMS FOR SOLUTION 

that a family contains exactly k boys is 


101 


2 ap k 

(2 - p) k+V 

10. Continuation. Given that a family includes at least one boy, what is the 
probability that there are two or more? 

11. Let the events A\, A 2 , • • *, A n be independent and Pr\Ak\ « Pk- Find the 
probability p that none of the events occur. 

12., Continuation. Show that always p < e~ lp k. 

13. Continuation. From Bonferoni’s inequality (6.7 of Chapter 4) deduce that 

the probability of k or more of the events A\, • • •, A n occurring simultaneously is 
less than (pi H-h p n ) k /k\. 

14. To example (2d). If the die turns up a red face n times in succession, the 
probability that it is die A is 

1 

1 4- (1/2)*' 


15. To Polya's urn scheme , example (2.c). Given that the second ball was black, 
what is the probability that the first was black?^ 

16. To Polya's urn scheme , example (2 .c). Show by induction that the probabil¬ 
ity of a black ball at any trial is b/(b + r). 

17. Continuation. Prove by induction: for any m < n the probabilities that 
the mth and the nth drawings produce the combinations (6, b) or (6, r) are 

b(b + c) br 

(b + r)(6 -f-r + c) (b + r)(6 -{- r -f c) 

Generalize to more than two drawings. 

18. Time symmetry of Polya's scheme. Let A and B stand either for black or red 
(so that AB can be any of the four combinations). Show that the probability of 
A at the nth drawing, given that the mth drawing yields B , is the same as the 
probability of A at the mth drawing when the nth drawing yields B. 

19. In the Polya scheme let Pk(n) be the probability of k black balls in the first 
n drawings. Prove the recurrence relation 


Pk(n + 1 ) = Pk(n) 


r + (n — k)c 
b + r ■+■ nc 


+ Vk-i(n) 


b -f“ (k — l)c 
b -+* r + nc 


where p~i(n) is to be interpreted as 0. Use this relation for a new proof of (2.5). 


Applications in Biology 

20. Under random mating less than half the population belongs to genotype Aa. 

21. Generalize the results of section 5 to the case where each gene can have any 

of the forms A\, A 2 , ♦ • •, Ak, so that there are on genotypes instead of three 
(multiple alleles). 

22. Brother-sister mating. Two parents are selected at random from a popula¬ 
tion in which the genotypes A A, Aa, aa occur with frequencies u, 2v, w. This process 
is repeated in their progeny. Find the probabilities that both parents of the 
first, second, third filial generation belong to A A [cf. Chapter 16, example (4.5)]. 



102 


CONDITIONAL PROBABILITY 


[5.8 


23. Selection . Let a be a recessive sex-linked gene, and suppose that a selection 
process makes mating of a-males impossible. If the genotypes AA, Aa, aa appear 
among females with frequencies u, 2v, w , show that for female descendants of the 
first generation u\ = u 4 v, 2vi » v 4 w, w\ =0 and hence pi = p 4 q/2, qi = q/2. 
That is to say, the frequency of the a-gene among females is reduced to one-half. 

24. The selection problem of section 7 can be generalized by assuming that only 
the fraction X (0 < X < 1) of the aa-class is eliminated. Show that 

u 4 v v 4- (1 — X)w 

p =-> q - -- 

1 — \w 1 — \w • 

More generally, (7.3) is to be replaced by 


Vn 

pn+l = r^’ 9n+l 


1 \q n 

q n ~ — * 

1 - Mn 


(The general solution of these equations appears to be unknown.) 

25. Consider simultaneously two pairs of genes with possible forms (A, a) and 
(B, 6), respectively. Any person transmits to each descendant one gene of each 
pair, and we shall suppose that each of the four possible combinations has proba¬ 
bility 1/4. (This is the case if the genes are on separate chromosomes; otherwise 
there is strong dependence.) There are then nine genotypes, and we assume that 
their frequencies in the parent population are Uaabb, UaaBB, Uaam, Uaabb, 2 Ua<iBB, 
ZUAabb, 2UAABby 2UaaBb , 4£/ AaBb- Put PaB = UaABB 4 VAABb 4 UAaBB 4* UAaBby 
VAb = UAAbb + Ua<M) 4 * UAABb 4 * UAaBby PaB = U aa BB 4 “ UaaBb 4 “ UAaBB 4 " 
UAaBby Vab = Uaabb 4 UAabb 4 U aa Bb 4 UAaBb- Compute the corresponding quan¬ 
tities for the first descendant generation. Show that for it Pab w = Vab — 5, 
VAb W • VAb 4 5, PaB a) — PaB 4 5, Po6 (1) = Pah “ 5 with 25 = p ABPab — VAbPaB • 
The stable distribution is given by 

Vab - 25 = pAb 4 25, etc. 

(Notice that Hardy's law does not apply: the composition changes from generation 
to generation.) 

26. Assume that the genotype frequencies in a population are u =* p 2 , 2v = 2 pq, 
w = q 2 . Given that a man is of genotype Aa, the probability that his brother is 
of the same genotype is (1 4 pq)/ 2. 

Note: The following problems are on family relations and give a meaning to the notion 
of degree of relationship , according to which brothers are as close to each other as grand¬ 
father and grandson. Each problem is a continuation of the preceding one. Random 
mating and the notations of section 5 are assumed. We are here concerned with a special 
case of Markov chains (cf. Chapter 15). Matrix algebra simplifies the writing. 

27. Number the genotypes A A, Aa, aa by 1, 2, 3, respectively, and let 
Pik(i, k * 1, 2, 3) be the conditional probability that an offspring be of genotype 
k if it is known that the male (or female) parent is of genotype i. Compute the 
nine probabilities pa, assuming that the probabilities for the other parent to be 
of genotype 1, 2, 3 are p 2 , 2 pq, q 2 , respectively. 

28. Show that pa is also the conditional probability that the parent is of genotype 
k if it is known that a specified offspring is of genotype i. 

29. Prove that the conditional probability of a grandson (grandfather) to be of 



5.8] 


PROBLEMS FOR SOLUTION 


103 


genotype k if it is known that the grandfather (grandson) is of genotype i is given 
by 

Pik (2) — Pilplk + Pi2p2k + PidPtk 

[The matrix ( p i* (2) ) is the square of the matrix (/>»*).] 

30. Show that is also the conditional probability that a man is of genotype 
k if it is known that a specified brother (not twin) is of genotype i. 

31. Show that the conditional probability of a man to be of genotype k when it 
is known that a specified great-grandfather (or great-grandson, or uncle, or nephew) 
is of genotype i is given by 

Pi*® = Pil <2) pUc + Vil (V Vik + Va i2) Vik = PilPtt® + Pi2P2* <2) + P<SP3t 0) . 

[The matrix (p.*®) is the third power of the matrix This procedure gives 

a precise meaning to the notion of the degree of family relationship.] 

32. More generally, define probabilities that a descendant of nth genera¬ 

tion is of genotype & if a specified ancestor was of genotype i. Prove by induction 
that the p%k^ are given by the elements of the following matrix 

/p 2 + pq/2 n ~ l 2pq + q{q - p)/ 2 n_1 q 2 - q 2 /2 n ~ l \ 

( P 2 + p(q - p)/ 2 n 2 pq + (1 - 4pq)/2 n q 2 + q(p - q)/2 n J 

\p 2 — p 2 /2 n ~ l 2 pq + p(p — q)/2 n ~ x q 2 + pq/2 n ~~ x / 

(This shows that the influence of an ancestor decreases from generation to genera¬ 
tion by the factor H-) 



CHAPTER 6 


THE BINOMIAL AND THE POISSON DISTRIBUTIONS 
1. Bernoulli Trials 1 

Repeated independent trials are called Bernoulli trials if there are only 
two possible outcomes for each trial and their probabilities remain the same 
throughout the trials. It is usual to denote the two probabilities by p 
and g, and to refer to the outcome with probability p as “success,” S , 
and to the other as “ failure ” F. Clearly, p and g must be non-nega¬ 
tive, and 

( 1 . 1 ) P + Q = 1 . 

The sample space of each individual trial is formed by the two 
points S and F. The sample space of n Bernoulli trials contains 2 n 
points or successions of n symbols S and F, each point representing one 
possible outcome of the compound experiment. Since the trials are 
independent, the probabilities multiply. In other words, the probability 
of any specified sequence is the product obtained on replacing the symbols 
S and F by p and q, respectively . Thus Pr{ (SSFSF • • • FFS) j 
= VPQPQ • • ‘ QQP- 

Examples. The most familiar example of Bernoulli trials is provided 
by successive tosses of a true or symmetric coin; here p = q = 1/2. 
If the coin is unbalanced, we still assume that the successive tosses are 
independent so that we have a model of Bernoulli trials in which the 
probability p for success can have an arbitrary value. If four faces 
of a “good” die are red and two black, then we have only two dis¬ 
tinguishable outcomes, to which for obvious reasons we attribute 
probabilities p = 2/3 and q = 1/3. Often there are several possible 
outcomes, but we have no interest in distinguishing among them and 
prefer to describe any result as a simple alternative. Thus with good 
dice the distinction between ace ( S ) and non-ace (F) leads to Bernoulli 
trials with p = 1/6, whereas distinguishing between even or odd leads 
to Bernoulli trials with p = 1/2. If the die is unbalanced, the succes- 

1 James Bernoulli (1064-1705). His main work, the Ars Conjectandi , was pub¬ 
lished in 1713. 


104 



6 . 2 ] 


THE BINOMIAL DISTRIBUTION 


105 


sive throws still form Bernoulli trials, but the corresponding probabili¬ 
ties p are different. A royal flush in poker or double ace in rolling 2 
dice may represent success; calling all other outcomes failure, we have 
Bernoulli trials with p — 1/649,740 and p = 1/36 respectively. Re¬ 
ductions of this type are usual in statistical applications. For example, 
washers produced in mass production may vary continuously in thick¬ 
ness, but, on inspection, they are classified as conforming (S) or defec¬ 
tive ( F ) according as their thickness is, or is not, within prescribed 
limits. 

The Bernoulli scheme of trials is a theoretical model, and only 
experience can show whether it is suitable for the description of specified 
physical experiments. Our knowledge that successive tossings of a 
coin conform to the Bernoulli scheme is derived from experimental 
evidence. The man in the street, and also the philosopher K. Marbe, 2 
believe that after a run of seventeen heads tail becomes more probable. 
This argument has nothing to do with imperfections of physical coins; 
it endows nature with memory, or, in our terminology, it denies the 
statistical independence of successive trials. Marbe’s theory cannot 
be refuted by logic but has been rejected because of lack of empirical 
support. 

In sampling practice, industrial quality control, etc., the scheme of 
Bernoulli trials provides an ideal standard even though it can never 
be fully attained. Thus, in the above example of the production of 
washers, there are many reasons why the output cannot conform to 
the Bernoulli scheme. The machines are subject to changes, and hence 
the probabilities do not remain constant; there is a persistence in the 
action of machines, and therefore long runs of deviations of like kind 
are more probable than they would be if the trials were truly independ¬ 
ent. From the point of view of quality control, however, it is desirable 
that the process conform to the Bernoulli scheme, and it is an important 
discovery that, within certain limits, production can be made to behave 
in that way. The purpose of continuous control is then to discover at 
an early stage flagrant departures from the ideal scheme and to use 
them as an indication of impending trouble. 

2. The Binomial Distribution 

Frequently we are interested only in the total number of successes 
produced in a succession of n Bernoulli trials but not in their order. 
The number of successes can be 0, 1, • • •, n, and our first problem is 

* Die Oleichformigkeit in der Welt , Munich, 1916. There exists a huge critical 
literature on Marbe’s theory. 



106 


BINOMIAL AND POISSON DISTRIBUTIONS 


[ 6.2 


to determine the corresponding probabilities. Now the event “n trials 
result in k successes and n — k failures” can happen in as many ways 
as k letters S can be distributed among n places. In other words, our 


event contains 



points, and, by definition, each point has the prob¬ 


ability p k q n k . This proves the 


Theorem. Let b(k; n, p) be the probability that n Bernoulli trials with 
probabilities p for success and q = 1 — p for failure result in k successes 
and n — k failures (0 < k < n). Then 


( 2 . 1 ) 


b(k;n , p) 



p k q n k . 


In particular, the probability of no success is q n , and the probability 
of at least one success is 1 — q n . 

The number of successes in n Bernoulli trials will later be considered 
as a special random variable (cf. Chapter 9). In the general termi¬ 
nology, the function (2.1) is the “distribution” of this random variable, 
and we shall refer to it as the binomial distribution. The attribute 
“binomial” refers to the fact that (2.1) represents the kth term of the 
binomial expansion of (p + q) n . This remark shows also that 
6(0; n, p) + 6(1; n, p) + • ■ ■ + b(n; n , p) = (p + q) n = 1, as is re¬ 
quired by the notion of probability. Formula (2.1) is sometimes called 
Newton’s theorem. 

Examples, (a) The probabilities of 0, 1, • • •, 6 heads in 6 tosses of a 
symmetric coin are 1/64, 6/64, 15/64, 20/64, 15/64, 6/64, 1/64, 
respectively. These are also the probabilities of having 0,1, • • *, 6 
odd digits among 6 random digits. Again, the same figures can, in a 
rough approximation, be interpreted as probabilities that among 6 
children 0,1, • • •, 6 will be girls. 

(6) In 10 throws of an ideal die the probability of exactly one ace 

is 10(1/6) (5/6) 9 = 0.323011_ The probability of at least one ace is 

1 — (5/6) 10 = 0.838494_ The probability of more than one ace 

is the difference, or 0.515483- 

(c) Weldon f s Dice Data. Let an experiment consist in throwing 12 
dice and let us count fives and sixes as “success.” If the dice are per¬ 
fect, the probability of success is p = 1/3, and the number of successes 
should follow the binomial distribution b(k; 12, 1/3). Table 1 gives 
these probabilities, together with the corresponding observed average 
frequencies in 26,306 actual experiments. The agreement looks good, 
but for such extensive data it is really very bad. Statisticians usually 



6 . 2 ] 


THE BINOMIAL DISTRIBUTION 


107 


judge closeness of fit by the chi-square criterion. According to it, 
deviations as large as those observed would happen with true dice 
only four times out of 10,000. It is, therefore, reasonable to assume 
that the dice were biased. A bias with probability of success p = 0.3377 
fits the observations. 3 


TABLE 1 

Weldon’s Dice Data 


k 

b(k; 12, 1/3) 

Observed 

Frequency 

b(k; 12, 0.3377) 

0 

0.007 707 

0.007 033 

0.007 123 

1 

.046 244 

.043 678 

.043 584 

2 

.127 171 

.124 116 

.122 225 

3 

.211 952 

.208 127 

.207 736 

4 

.238 446 

.232 418 

.238 324 

5 

.190 757 

.197 445 

.194 429 

6 

.111 275 

.116 589 

.115 660 

7 

.047 689 

.050 597 

.050 549 

8 

.014 903 

.015 320 

.016 109 

9 

.003 312 

.003 991 

.003 650 

10 

.000 497 

.000 532 

.000 558 

11 

.000 045 

.000 152 

.000 052 

12 

.000 002 

.000 000 

.000 002 


(< d ) We have encountered several special cases of the binomial dis¬ 
tribution. For example, the probability of the digit 7 appearing 
among n random digits exactly k times is given by (2.1) with p = 1/10. 
Table 1 of Chapter 3 gives the binomial distribution with n = 100, 
p =* 1/10, together with actual counts of the occurrence of the digit 7. 
In Chapter 4, section 4, we applied the binomial distribution to a card¬ 
guessing problem, and the columns b n of Table 1 give the terms of 
the binomial distributions with n = 3, 4, 5, 6, 10 and p = 1/n. In 
the problem of random distributions of objects in cells we found formula 
(3.1) of Chapter 3, which is again a special case of the binomial distri¬ 
bution for p = 1/n. 

(e) If the probability of success is 0.01, how many trials are necessary 
in order for the probability of at least one success to be 1/2 or more? 
Here we seek the smallest integer n for which 1 — (0.99) n > 1/2, or 
— n log (0.99) > log 2; therefore n > 70. 

a R. A. Fisher, Statistical Methods for Research Workers , Edinburgh-London, 
1932, p. 66, or T. C. Fry, Probability and Its Engineering Uses , New York, 1928, 
pp. 303ff. 



108 


BINOMIAL AND POISSON DISTRIBUTIONS 


[6.2 


(/) Power-supply Problems . Typical of many applications is the 
following problem. A motor generates power to be used intermittently 
by n workers. To get a crude idea of the load to be expected we 
imagine that at any given time each worker has the same probability p 
of requiring a unit of power. If they work independently, the prob¬ 
ability of exactly k workers requiring power at the same time should 
be b(k ; n, p ). If, on the average, a worker uses power for 12 minutes 
per hour, one would put p = 1/5. The binomial distribution is used 
in practice to compute the probability of undesirable overloads or, 
what amounts to the same thing, to find an optimum number n of 
workers per motor. A finer analysis of the situation would require 
knowledge of the frequency of overloads, their average duration, etc. 
These problems necessitate the consideration of the process with regard 
to time dependence and lead to the theory of stochastic (= random) 
processes. 

(< g ) Sampling. If a population consists of a red and b black elements, 
then random sampling with replacement conforms to the Bernoulli 
scheme, and the probability that a sample of size n will contain exactly 
k red elements is given by the binomial distribution (2.1) with 
p = a/(a + 6). In sampling without replacement the corresponding 
probability is given by the hypergeometric distribution 



[cf. (5.1) of Chapter 2]. When the population size a + b is large 
compared to n, the probability (2.2) is close to b(k ; n, a/(a + 5)), so 
that the binomial distribution applies approximately also to sampling 
without replacement from large populations (cf. problem 42 of section 
8 of Chapter 2). This fact is used in many ways. For example, in 
industrial quality control, lots containing N items each are subjected 
to sampling inspection with the aim of eliminating all lots containing 
an excess of defective items. The probability of a sample of size n 
containing k defective items is given by (2.2), but this formula is 
cumbersome and is usually replaced by the binomial distribution, even 
though sampling without replacement is used. 

( h ) Banach's Match-box Problem . 4 A certain mathematician always 
carries two match boxes; every time he wants a match, he selects a box 
at random. Inevitably a moment occurs when, for the first time, he 


4 Communicated by H. Steinhaus. 



6.3] 


THE CENTRAL TERM 


169 

finds a box empty. At that moment the other box may contain 
r = 0,1, 2, ..matches, and we wish to find the corresponding prob¬ 
abilities w r . Assume that initially each box contained N matches. The 
event “when first box is found empty, the second contains r matches” 
means that out of the first N + (N — r) matches drawn, N are from 
the first box. This corresponds to N successes in 2 N — r trials, and 
therefore 



[Cf. Table 2. The discussion of the problem is continued in example 
(3./) of Chapter 9.] 

TABLE 2 


Probabilities (2.3) 


r 

U T 

U r 

r 

» u r 

U r 

0 

0.079 589 

0.079 589 

15 

0.023 171 

0.917 941 

1 

.079 589 

.159 178 

16 

.019 081 

.937 022 

2 

.078 785 

.237 963 

17 

.015 447 

.952 469 

3 

.077 177 

.315 140 

18 

.012 283 

.964 752 

4 

.074 790 

.389 931 

19 

.009 587 

.974 338 

5 

.071 674 

.461 605 

20 

.007 338 

.981 676 

6 

.067 902 

.529 506 

21 

.005 504 

.987 180 

7 

.063 568 

.593 073 

22 

.004 041 

.991 220 

8 

.058 783 

.651 855 

23 

.002 901 

.994 121 

9 

.053 671 

.705 527 

24 

.002 034 

.996 155 

10 

.048 363 

.753 890 

25 

.001 392 

.997 547 

11 

.042 989 

.796 879 

26 

.000 928 

.998 475 

12 

.037 676 

.834 555 

27 

.000 602 

.999 077 

13 

.032 538 

.867 094 

28 

.000 379 

.999 456 

14 

.027 676 

.894 770 

29 

.000 232 

.999 688 

u r is the 

probability 

that, at the moment 

the 

first match box is found empty, 


the second contains exactly r matches, assuming that initially each box contained 
60 matches. U r uq -f u\ + • • • -f u r is the corresponding probability of having 
not more than r matches. 


3. The Central Term 

From (2.1) we see that 

(3 i) h Jtl w ’ ?1 _ ( w ~ + 1)P = j + (” + 1)P ~ k 

b(k — 1 ; n, p) kq kq 

This ratio is greater than or less than one according as k is less than 



110 BINOMIAL AND POISSON DISTRIBUTIONS [6.4 

or greater than (n + 1 )p. Accordingly, the term b(k; n, p) is greater 
than the preceding one if k < (n + 1 )p and is smaller if k > (n + 1 )p. 
If (n + l)p = m happens to be an integer, then b(m; n, p) 
= b(m — 1; n, p). In general, if m is the largest integer not exceeding 
(n + 1 )p, then as k goes from 0 ton the terms b(k; n, p) first increase and 
then decrease , reaching their greatest value for k = m. If (n + 1 )p is 
an integer, there are two equal terms. We shall call 6(m; n, p) the 
central term . Often m is called “the most probable number of suc¬ 
cesses,” but it must be remembered that for large values of n all terms 
b(k ; n, p) are small so that even the most probable number of successes 
has only a small probability. In 100 tossings of a true coin the most 

probable number of heads is 50, but its probability is only 0.079589_ 

In the next chapter we shall find that the central term b(m; n , p) is 
approximately (2irnpq)~^. There we shall also deduce simple approxi¬ 
mations to other terms of the binomial distribution and to their sums. 
These approximations enable us to evaluate probabilities whose direct 
computation is cumbersome, e.g., the probability that in 1000 tossings 
of a coin the number of heads is between 485 and 508. 

4. The Poisson Approximation 6 

In many applications we deal with Bernoulli trials where, compara¬ 
tively speaking, n is large and p is small, whereas the product 

(4.1) X = np 

is of moderate magnitude. In such cases it is convenient to use an 
approximation formula to b(k; n, p) which is due to Poisson and which 
we proceed to derive. We have 6(0; n, p) = (1 — p) n or, substituting 
from (4.1), 

(4.2) 6(0; n, p) = ^1 - ^ • 

Passing to logarithms and using the Taylor expansion [formula (6.91 
of Chapter 2] we find 

/ X\ X 2 

(4.3) log 6(0; n, p) = n log (1-) = -X - -- ..., 

\ n/ 2 n 

so that for large n, 

(4.4) 6(0; n,.p) « e~\ 

* Simeon 1). Poisson (1781-1840). His book, Recherches tur la probabilitt des 
jugements en matihre criminelle et en matibre civile , prbcedbes des rhgles g6n6rales du 
calcul des probability, appeared in 1837. 



6.4] THE POISSON APPROXIMATION 111 

Furthermore 

np X 

(4.5) 6(1; n, p) = — 6(0; n, p) = --- 6(0; n, p), 

q 1 — X/n 

and therefore 

(4.6) 6(1; n ) p) « Xe~ x . 

Similarly 

n(n — 1 )p 2 X 2 

(4.7) 6(2; n, p) = 2 6(0; n, p) ~ - <T X . 

2 q 2 2 

Continuing in this way, it is seen that for sufficiently large n and any 
fixed X 

X* 

(4.8) b(k; n,p)~— e -x . 

This is the famous Poisson approximation io the binomial distribution . 
For any fixed X^the error committed in (4.8) tends to zero with in¬ 
creasing n. We shall see that the error is of the order of magnitude 
X 2 /n so that (4.8) is of practical use if X 2 is small as compared to n. 
Before discussing the error term we proceed to illustrate the use of 

(4.8) and to calculate the error involved in it in a few special cases. 
It is convenient to have a symbol for the right-hand member in (4.8), 
and we shall put 

(4.9) p(k; X) = c~ x 

With this notation p(h;\) should be an approximation to 6(A;; n, X/n) 
when n is sufficiently large. 

Examples, (a) The entries p m of the last column of Table 1 in 
Chapter 4 give the values p(?n; 1). In the preceding columns b m stands 
for b(m; N, 1/N ). The table enables us to compare the Poisson distri¬ 
bution p(m; 1) with the binomial distributions with p = 1/n and 
n = 3, 4, 5, 6, 10. It will be seen that the agreement is surprisingly 
good despite the small values of n. 

(6) Table 3 compares p(k ; 1) to the binomial distribution with 
n = 100, p — 1/100. It shows the approximation to be satisfactory 
for many purposes. As an example take the occurrence of the com¬ 
bination (7, 7) among 100 pairs of random digits, which should have 
the binomial distribution 6(fc; 100, 1/100). The last column of Table 3 
gives the actual counts in 100 batches of 100 random digits each 
recorded in Table 2 of Chapter 2. To obtain relative frequencies all 



112 


BINOMIAL AND POISSON DISTRIBUTIONS 


[64 


entries of the last column should be divided by 100. These frequencies 
agree reasonably with the theoretical probabilities. (As judged by the 
X 2 -criterion, chance fluctuations should, in about 75 out of 100 similar 
cases, produce larger deviations of observed frequencies from the 
theoretical probabilities.) 


TABLE 3 

An Example of the Poisson Approximation 


k 

b(k; 100, 1/100) 

p(k; 1) 

N k 

0 

0.366 032 

0.367 879 

41 

I 

.369 730 

.367 879 

34 

2 

.184 865 

.183 940 

16 

3 

.060 999 

.061 313 

8 

4 

.014 942 

.015 328 

0 

5 

.002 898 

.003 066 

1 

6 

.000 463 

.000 511 

0 

7 

.000 063 

.000 073 

0 

8 

.000 007 

.000 009 

0 

9 

.000 001 

.000 001 

0 


The first columns illustrate the Poisson approximation to the binomial distribu¬ 
tion. The last column records the number of batches of 100 pairs of random digits 
each from Table 2 of Chapter 2 in which the combination (7, 7) appears exactly 
k times. 


jQyBirthdays. What is the probability, p*, that in a company of 
"500 people exactly k will have birthdays on New Year’s day? If the 
500 people are chosen at random, we may apply the scheme of 500 
Bernoulli trials with probability of success p — 1/365. Then 

Po = (364/365) 500 = 0.2537- For the Poisson approximation we 

put X = 500/365 = 1.3699 - Then p(0; X) = 0.2541, which in¬ 

volves an error only in the fourth decimal place. For fc = 1,2,... the 
correct values of pu as calculated from the binomial formula are 
p x = 0.3484 ..., p 2 - 0.2388 ..., p 3 = 0.1089 ..., p 4 = 0.0372 ..., 
Ps = 0.0101 ..., pe = 0.0023 .... The corresponding Poisson approx¬ 
imations are p(l; X) = 0.3481 p(2; X) = 0.2385..., p(3; X) 

= 0.1089 ..., p(4; X) = 0.0373 ..., p(5; X) = 0.0102 ..., p(6; X) 

= 0.0023 - All errors are in the fourth decimal place. 

(d) Defective Items. Suppose that screws are produced under statis¬ 
tical quality control so that it is legitimate to apply the Bernoulli 
scheme of trials. If the probability of a screw being defective is 
p = 0.015, then the probability that a box of 100 screws contains no 
defective is (0.985) 100 = 0.22061. The corresponding Poisson approxi- 



THE POISSON APPROXIMATION 


113 


0.4] 


mation is e~ 1 6 = 0.22313 ..which should be close enough for most 
practical purposes. We now ask: how many screws should a box con¬ 
tain in order that the probability of finding at least 100 conforming 
screws be 0.8 or better? If 100 + x is the required number, then x is a 
small integer. To apply the Poisson approximation for n = 100 + x 
trials we should put X = np , but np is approximately lOOp = 1.5. We 
then require the smallest integer x for which 


(4.10) 


o —1 *5 


1.5 

1+T +. 


• + 


(1.5)* 

x\ 


> 0 . 8 . 


In tables 6 we find that for x = 1 the left side is approximately 0.56, 
while for x = 2 it is 0.809. Thus the Poisson approximation would 
lead to the conclusion that 102 screws are required. Since 0.809 is 
dangerously near the given threshold of 0.8, the number 103 is safer. 
Actually the probability of finding at least 100 conforming screws in 
a box of 102 is 


(0.985) 102 + (0.985) lol (0.015) + (0.985) loo (0.015) 2 

= 0.8022 .... 

(e) Centenarians . At birth any particular person has a small chance 
of living 100 years, and in a large community the number of yearly 
births is large. Owing to wars, epidemics, etc., different lives are not 
statistically independent, but as a first approximation we may compare 
n births to n Bernoulli trials with death after 100 years as success. In a 
stable community, where neither size nor mortality rate changes 
appreciably, it is reasonable to expect that the frequency of years 
in which exactly k centenarians die is approximately p(k ; X), with X 
depending on the size and health of the community. Records of 
Switzerland confirm this conclusion. 7 

(/) Misprints , Raisins , etc. If in printing a book there is a constant 
probability of any letter being misprinted, and if the conditions of 
printing remain unchanged, then we have as many Bernoulli trials as 
there are letters. The frequency of pages containing exactly k mis¬ 
prints will then be approximately p(fc; X), where X is a characteristic of 
the printer. Occasional fatigue of the printer, difficult passages, etc., 
will increase the chances of errors and may produce clusters of mis- 

6 E. C. Molina, Poisson’s Exponential Binomial Limits New York, 1942. (These 
are tables giving p(k) X) and p(k; X) + P$ 4* 1; X) +... for A; ranging from 0 to 
10 °.) 

7 E. J. Gumbel, Les centenaires , Aktutirske V&ly, Prague, vol. 7 (1937), pp. 1-8. 



114 


BINOMIAL AND POISSON DISTRIBUTIONS 


[6.4 


prints. Thus the Poisson formula may be used to discover radical 
departures from uniformity or from the state of statistical control. 
A similar argument applies in many cases. For example, if many 
raisins are distributed in the dough, we should expect that thorough 
mixing will result in the frequency of loaves with exactly k raisins to 
be approximately p{k; X) with X a measure of the density of raisins in 
the dough. 

(g) Poisson Approximation to the Hypergeometric Distribution . In 
the sampling problem (g) of section 2 we have used the binomial 
distribution as an approximation to the hypergeometric distribution 
(2.2). Now if the number a of red elements is small as compared to 
the number b of black elements, then the probability p = a/(a + b) 
of “success” is small, and we can in turn approximate b(k ; n, p) by the 
Poisson expressions p(k ; X). In this case the ratio a/{a + b) is approxi¬ 
mately the same as a/b, and we conclude therefore: if a is small as 
compared to b and X = na/b is of moderate magnitude , then the hyper¬ 
geometric distribution (2.2) can be approximated by the Poisson expres¬ 
sions p(Jc; X). It is easy to verify this result directly by the method 
used to derive (4.8). 

As a specific example suppose that out of the crop of a grain field 
seeds are taken for the planting of a new field of equal size. A particu¬ 
lar plant may be mother to k = 0, 1 , 2, ... new plants, and we require 
the corresponding probabilities. If there are n plants in all, we may 
consider the seeds as a random sample of size n from the population of 
all seeds. Suppose that the given plant has a seeds, and that there 
are a + b seeds in all. Then formula (2.2) applies. If the number of 
plants is large, then a will be small as compared to b, and the probability 
that our plant is mother to exactly k plants is, approximately, 
p(k ; X) with X = na/b. 


Estimate of the Error. In practice we are not so much interested in the absolute 
error | b(k) n, X/n) — p{k; X) | as in the relative error 


(4.11) 


b(k; n , X/n) 
V(k; X) 


Now, using (2.1) and (4.9), it is readily seen that 
b(k;n f \/n) 


(4.12) 


p(k; X) 


(•-ao-D-o-^w-r- 


To estimate the product to the right we pass to logarithms and use the fact that 
all terms in the Taylor expansion [cf. formula (6.9) of Chapter 21 of log (1 — x) are 
negative. It follows that for x > 0 



6.51 


THE POISSON DISTRIBUTION 


115 


X 2 

(4.13) log (1 — x) < — x — — < — x , 

j£ 

and also 

(4.14) l 0g(1 _ x )>_ x __£L_ >r Z^. 
Hence 


b{k\ n, \/n) 1 + 2 + • • • -f- k 

log- 77T7 T — <- 

p{k; X) n 




(4.15) 


On the other hand. 


~ 1) X^ A:X fcX 2 

2n 2?t n 2n 2 

(fe - X) 2 k_ 

2 n 2/i 2n 2 


log 


&(&; n, X/n) 
p(/c; X) 


(4.16) 


> - 


1 + 2 + ■■•+(& - 1 ) 
n - /c + 1 


4* x 


- /x X 2 \ k\ 

11 \n 2/t(n — X) ' w 


_ (fc - X) 2 _ (fc - X - 1 )X 2 _ fc(fe - 1)X 

“ 2(/i — k 4- 1) 2(?t — \)(n — k -j- 1) ?i(n — k + 1) 

It follows that if X is restricted to a finite interval the relative error (4.11) tends 
uniformly to zero as (k — X) 2 //i — > 0. The relative error (4.11) can be appreciable 
only if (k — X) 2 / n is not small. However, p(k ; X) is largest when k is near X, and 
when k > 2X the terms p{k] X) decrease faster than a geometric series with ratio 
1/2. Also, it follows from (4.15) that for k > 2X + 1 the Poisson approximation 
overestimates b(k; n } \/n). Hence, if ( k — X) 2 /zi is not small, both p(k; X) and 
b(k; n, \/n) are negligible, and we conclude that in every finite \-interval as n —> <*> 
the difference b(k; n, \/n) — p(k] X) tends to zero uniformly for all k. 


6. The Poisson Distribution 

In the preceding section we have used the Poisson expression (4.9) 
merely as a convenient approximation to the binomial distribution in 
the case of large n and small p. In connection with the matching 
problem and the occupancy problem of Chapter 4 (sections 4 and 5) 
we have studied quite different probability distributions, which also 
led to the Poisson expressions p(fc; X) as a limiting form. We have here 
a special case of the remarkable fact that there exist a few distributions 
of great universality which occur in a surprisingly great variety of 
problems. The three principal distributions, with ramifications 
throughout probability theory, are the binomial distribution, the 
normal distribution (to be introduced in the following chapter), and 



116 BINOMIAL AND POISSON DISTRIBUTIONS [0.5 

the Poisson distribution 

(5.1) p(*;X) = e- x -, 

which we shall now consider on its own merits. 

We note first that on adding the equations (5.1) for k = 0, 1, 2, ... 
we get on the right side e~~ x times the Taylor series for e x . Hence for 
any fixed X the quantities p{k ; X) add to unity, and therefore it is 
possible to conceive of an ideal experiment in which p(k; X) is the 
probability of exactly k successes. We shall now indicate why many: 
physical experiments and statistical observations actually lead to such 
an interpretation of (5.1). The examples of the next section will 
illustrate the wide range and the importance of various applications 
of (5.1). The true nature of the Poisson distribution will become 
apparent only in connection with the theory of stochastic processes 
(cf. Chapter 17, where a new approach to the Poisson distribution is 
given). 

Consider a sequence of random events occurring in time, such as 
radioactive disintegrations or incoming calls at a telephone exchange. 
Each event is represented by a point on the time axis, and we are 
concerned with chance distributions of points. There exist many 
different types of such distributions, but their study belongs to the 
domain of continuous probabilities which we have postponed to the 
second volume. Here we shall be content to show that the simplest 
physical assumptions lead to p(k] X) as the probability of finding exactly 
k points (events) within a fixed interval of specified length. Our 
methods are necessarily crude, and we shall return to the same problem 
with more adequate methods in Chapter 17. 

The physical assumptions which we want to express mathematically 
are that the conditions of the experiment remain constant in time, and 
that non-overlapping time intervals are statistically independent in the 
sense that information concerning the number of events in one interval 
reveals nothing about the other. The theory of probabilities in a 
continuum makes it possible to express these statements directly, but 
being restricted to discrete probabilities, we have to use an approxi¬ 
mate finite model and pass to the limit. 

Imagine the unit time interval divided into a great number n of 
intervals, each of length 1/n. Either a particular subinterval is empty, 
or it contains at least one of our random points (or events), and we 
agree to call the two possibilities failure and success, respectively. 
The probability p n of success must be the same for all n subintervals, 
£ince they have the same length. The assumed independence of non- 



6.5] 


THE POISSON DISTRIBUTION 


117 


overlapping intervals then implies that we have n Bernoulli trials, and 
the probability of exactly k successes is given by b(k; n, p n ). N ow the 
number of successes is not necessarily the same as the number of 
random points, since a subinterval may contain several random points. 
However, it is natural" to introduce the additional assumption that the 
probability of two or more random points during a very short time 
interval is in the limit negligible. 8 In this case the probability of finding 
exactly k random points in the unit time interval is given by the 
limit of b(k ; n, p n ) as n —> oo. Now when we divide each subinterval 
into two parts of equal length, we find that p n = 2 p 2n — P 2 n\ this 
equation states that success in an interval of length 1 /n means either 
success in the left half, or success in the right half, or in both. It 
follows that p n < 2p 2 n, and this suggests that np n increases monotoni- 
cally (which can be proved rigorously). If np n —>X, then b(k ; n, p n ) 
~ b(k; n, X/n) —► p(k; X), and we find (5.1) as the probability that there 
is a total of k random points contained in our unit interval. The 
assumption np n —> co leads to no sensible result, as it would imply 
infinitely many random points even in the smallest interval. 

If, instead of the unit interval, we take an arbitrary interval of 
length t and again use a subdivision into intervals of length 1/n, then 
we have Bernoulli trials with the same probability p n of success, but 
the number of trials is the integer nearest to nt rather than n. The 
passage to the limit is the same, but we get \t instead of X. This leads 
us to consider 

(* 0 * 

(5.2) p(fc;X<) = e —— 

kl 

as the probability of finding exactly k points in a fixed interval of length t. 
In particular , the probability of no point in an interval of length t is 

(5.3) p{ 0; U) = e“ x *. 

and the probability of one or more points is therefore 1 — e~ x *. 

The parameter X is a physical constant which determines the density 
of points on the £-axis. The larger X, the smaller is the probability 

(5.3) of finding no point. Suppose that a physical experiment is 
repeated a great number N of times and that each time we count the 

8 This assumption is implicit in the intuitive picture of isolated random points. 
However, it is necessary to exclude the possibility of our events appearing in 
doublets. For example, if the events are automobile accidents, then the proba¬ 
bility of two events within a short time is negligible in comparison with the prob¬ 
ability of one event. On the other hand, an accident is likely to involve two cars, 
and if the events mean “a car smashed” then they are likely to appear in pairs 
and our assumption does not apply. 



118 BINOMIAL AND POISSON DISTRIBUTIONS [6.5 

number of events in an interval of fixed length t Let Nk be the number 
of times that exactly k events are observed. Then 

(5.4) N 0 + N t + N 2 + ... = N. 

The total number of points observed in the N experiments is 

(5.5) Ni + 2N 2 + SN 3 + ... = T, 

and T/N is the average. Now if N is large, we expect that 

(5.6) N k « Np(k; \t) 

(this lies at the root of all applications of probability and will be justi¬ 
fied and made more precise by the law of large numbers in Chapter 10). 
Substituting from (5.6) into (5.5), we find 

T » N{p( 1; \t) + 2p(2; \t) + 3p(3; \t) +...} 

[ A t (AO 2 1 

(5.7) = Ne xt \t 11 H—-—I———f-... | = N\t 
and hence 

T 

(5.8) A* « — 

N 

This relation gives us a means of estimating A from observations and 
of comparing theory with experiments. The examples of the next 
section will illustrate this point. 

Spatial Distributions. We have considered the distribution of ran¬ 
dom events or points along the £-axis, but the same argument applies 
to the distribution of points in plane or space. Instead of intervals of 
length t we have domains of area or volume t , and the fundamental 
assumption is that the probability of finding k points in any specified 
domain depends only on the area or volume of the domain, but not 
on its shape. Otherwise we have the same assumptions as before: 
(1) if £ is small, the probability of finding more than one point in a 
domain of volume t is small as compared to t; (2) non-overlapping 
domains are mutually independent. To find the probability that a 
domain of volume t contains exactly k random points we subdivide it 
into n subdomains and approximate the required probability by the 
probability of k successes in n trials. This means neglecting the possi¬ 
bility of finding more than one point in the same subdomain, but our 
assumption (1) implies that the error tends to zero as n oo. In the 
limit we get again the Poisson distribution (5.2). Stars in space, 
raisins in cake, weed seeds among grass seeds, Saws in materials, ani¬ 
mal litters in fields are distributed in accordance with the Poisson law. 
For numerical examples cf. section 6, examples (6) and (e). 



6.6J OBSERVATIONS FITTING THE POISSON DISTRIBUTION 119 


6. Observations Fitting the Poisson Distribution 9 

(a) Radioactive Disintegrations . A radioactive substance emits a- 
particles, and the number of particles reaching a given portion of space 
during time t is the best-known example of random events obeying 
the Poisson law. Of course, the substance continues to decay, and in 
the long run the density of a-particles will decline. However, with 
radium it takes years before a decrease of matter can be detected; for 
relatively short periods the conditions must be considered constant, 
and we have an ideal realization of the hypotheses which led to the 
Poisson distribution. 

In a famous experiment 10 a radioactive substance was observed 
during N = 2608 time intervals of 7.5 seconds each; the number of 
particles reaching a counter was obtained for each period. Table 4 
records the number Nk of periods with exactly fc particles. The total 
number of particles is T = 2kNk = 10,094, the average T/N = 3.870. 
The theoretical values Np(Jc; 3.870) are seen to be rather close to the 
observed numbers Nk- To judge the closeness of fit, an estimate of 
the probable magnitude of chance fluctuations is required. Statisti¬ 
cians judge the closeness of fit by the x 2 -criterion. Measuring by this 


TABLE 4 

Example (a): Radioactive Disintegrations 


k 

N k 

Np(k; 3.870) 

0 

57 

54.399 

1 

203 

210.523 

2 

383 

407.361 

3 

525 

525.496 

4 

532 

508.418 

5 

408 

393.515 

6 

273 

253.817 

7 

139 

140.325 

8 

45 

67.882 

9 

27 

29.189 

k > 10 

16 

17.075 

Total 

2608 

2608.000 


• The Poisson distribution has become known as the law of small numbers or of 
rare events. These are misnomers which proved detrimental to the realization of 
the fundamental role of the Poisson distribution. The following examples will 
show how misleading the two names are. 

10 Rutherford, Chadwick, and Ellis, Radiations from Radioactive Substances, Cam¬ 
bridge, 1920, p. 172. Table 4 and the x 2 -est,imate of the text are taken from 
H. Cramer, Mathematical Methods of Statistics , Upsala and Princeton, 1945, p. 436. 



120 


BINOMIAL AND POISSON DISTRIBUTIONS 


[ 6.6 


standard one should expect that under ideal conditions about 17 out 
of 100 comparable cases show worse agreement than exhibited in 
Table 4. 

(6) Flying-bomb Hits on London . As an example of a spatial dis¬ 
tribution of random points consider the statistics of flying-bomb hits 
in the south of London during World War II. The entire area is divided 
into N = 576 small areas of t = 34 square kilometers each, and Table 5 
records the number Nk of areas with exactly Jc hits. 11 The total number 

of hits is T = XkNk = 637, the average \t = T/N = 0.9323 _ The 

fit of the Poisson distribution is surprisingly good: as judged by the 
X 2 -criterion, under ideal conditions some 88 per cent of comparable 
observations should show a worse agreement. It is interesting to note 
that most people believed in a tendency of the points of impact to 
cluster. In this case we would have a higher frequency of areas with 
either many hits or no hit, and a deficiency in the intermediate classes. 
Table 5 indicates perfect randomness and homogeneity of the area; 
we have here an instructive illustration of the established fact that to 
the untrained eye randomness appears as regularity or tendency to 
cluster. 

TABLE 5 

Example (b): Flying-bomb Hits on London 

k 0 1 2 3 4 5 and over 

N k 229 211 93 35 7 1 

p(k; 0.9323) 226.74 211.39 98.54 30.62 7.14 1.57 


(c) Chromosome Interchanges in Cells . Irradiation by X-rays pro¬ 
duces certain processes in organic cells which we call chromosome inter¬ 
changes. As long as radiation continues, the probability of such 
interchanges remains constant, and, according to theory, the numbers 
Nk of cells with exactly k interchanges should follow a Poisson distri¬ 
bution. The theory is also able to predict the dependence of the 
parameter X on the intensity of radiation, the temperature, etc., but 
we shall not enter into these details. Table 6 records the result of 
eleven different series of experiments. 12 These are arranged according 

u The figures are taken from R. D. Clarke, An Application of the Poisson Dis¬ 
tribution, Journal of the Institute of Actuaries , vol. 72 (1946), p. 48. 

12 D. G. Catcheside, D. E. Lea, and J. M. Thoday, Types of Chromosome Struc¬ 
tural Change Induced by the Irradiation of Tradescardia Microspores, Journal of 
Genetics , vol. 47 (1945-46), pp. 113-136. Our table is Table IX of this paper, 
except that the x 2 4evels were recomputed, using a single degree of freedom. 



6.6] OBSERVATIONS FITTING THE POISSON DISTRIBUTION 121 

TABLE 6 


Example (c): Chromosome Interchanges Induced by X-ray Irradiation 


Experi¬ 

ment 

Num¬ 

ber 


Cells with k Interchanges 

Total N 

x 2 - 
Level 
in Per 
Cent 

0 

1 

2 

>3 

1 

Observed Nk 

753 

266 

49 

5 

1073 

95 


Np(lc, 0.35508) 

752.3 

267.1 

47.4 

6.2 



2 

Observed Nk 

434 

195 

44 

9 

682 

85 


Np{k] 0.45601) 

432.3 

197.1 

44.9 

7.7 



3 

Observed Nk 

280 

75 

12 

1 

368 

65 


Np(k; 0.27717) 

278.9 

77.3 

10.7 

1.1 



4 

Observed Nk 

2278 

273 

15 

0 

2566 

65 


Np(k; 0.11808) 

2280.2 

269.2 

15.9 

0.7 



5 

Observed Nk 

593 

143 

20 

3 

759 

45 


Np(k; 0.25296) 

589.4 

149.1 

i 

18.8 

1.7 



6 

Observed Nk 

639 

141 

13 

0 

793 

45 


Np(k ; 0.21059) 

642.4 

135.3 

14.2 

1.1 



7 

Observed Nk 

359 

109 

13 

1 

482 

40 


Np(k ; 0.28631) 

362.0 

103.6 

14.9 

1.5 



8 

Observed Nk 

493 

176 

26 

2 

697 

35 


Np(k; 0.33572) 

498.2 

167.3 

28.1 

3.4 



9 

Observed Nk 

793 

339 

62 

5 

1199 

20 


Np(k; 0.39867) 

804.8 

320.8 

64.0 

9.4 



10 

Observed Nk 

579 

254 

47 

3 

883 

20 


Np(k) 0.40544) 

588.7 

238.7 

48.4 

7.2 



11 

Observed Nk 

444 

252 

59 

1 

756 

5 


Np(k ; 0.49339) 

461.6 

227.7 

56.2 

10.5 




to goodness of fit. The last column indicates the approximate per¬ 
centage of ideal cases in which chance fluctuations would produce a 
worse agreement (as judged by the x 2 -standard). It is difficult to 
imagine a better agreement between theory and observation. 




122 


BINOMIAL AND POISSON DISTRIBUTIONS 


[ 6.6 


(d) Connections to Wrong Number. Table 7 shows statistics of 
telephone connections to a wrong number. 18 A total of N = 267 

TABLE 7 

Example ( d ): Connections to Wrong Number 


k 

N k 

Np(k; 8.74) 

0-2 

1 

2.05 

3 

5 

4.76 

4 

11 

10.39 

6 

14 

18.16 

6 

22 

26.45 

7 

43 

33.03 

8 

31 

36.09 

9 

40 

35.04 

10 

35 

30.63 

11 

20 

24.34 

12 

18 

17.72 

13 

12 

11.92 

14 

7 

7.44 

15 

6 

4.33 

>16 

2 

4.65 


267 

267.00 


numbers was observed; Nk indicates how many numbers had exactly 
k wrong connections. The Poisson distribution p(k; 8.74) shows again 

an excellent fit. (As judged by 
the x 2 -criterion the deviations 
are near the median value.) In 
Thorndike’s paper the reader 
will find other telephone statis¬ 
tics following the Poisson law. 
In some cases (as with party 
lines, calls from groups of coin¬ 
boxes, etc.) there is an obvious 
interdependence among the 
events, and the Poisson distri¬ 
bution no longer fits. 

(e) Bacteria and Blood Counts . 
Figure 3 reproduces a photo¬ 
graph of a Petri plate with bac¬ 
terial colonies, which are visible under the microscope as dark spots. 

u The observations are taken from F. Thorndike, Applications of Poisson’s Prob¬ 
ability Summation, The BeU System Technical Journal , vol. 6 (1926), pp. 604-624. 
The paper contains a graphical analysis of 32 different statistics. 



Figure 3. Bacteria on a Petri Plate. 



6.6] OBSERVATIONS FITTING THE POISSON DISTRIBUTION 123 


TABLE 8 


Example ( e ): Counts of Bacteria 


k 

0 

1 

2 

. 

3 

4 

5 

. 

6 

7 

, 

, 

x 2 - 

Level 

Observed Nk 

5 

19 

20 

20 

21 

13 

: 8 

, 

| 97 

Poisson theor. 

G.l 

18.0 

26.7 

26.4 

19.0 

11.7 

9.5 



Observed Nk 

20 

40 

38 

17 

7 




60 

Poisson theor. 

27.5 

42.2 

32.5 

10.7 

9.1 





Observed Nk 

69 

80 

49 

30 

20 




26 

Poisson theor. 

55.0 

82.2 

00.8 

30.0 

15.4 





Observed Nk 

83 

131 

135 

101 

40 

10 

7 


63 

Poisson theor. 

75.0 

144.5 

139.4 

89.7. 

43.3 

10.7 

7.4 



Observed Nk 

8 

10 

18 

15 

9 

7 



97 

Poisson theor. 

0.8 

10.2 

19.2 

15.1 

9.0 

0.7 




Observed Nk 

7 

11 

11 

11 

7 

8 



53 

Poisson theor. 

3.9 

10.4 

13.7 

12.0 

7.9 

7.1 




Observed Nk 

3 

7 

14 

21 

20 

19 

7 

9 

85 

Poisson theor. 

2.1 

8.2 

15.8 

20.2 

19.5 

15 

9.6 

9.6 


Observed Nk 

00 

80 

45 

16 

9 




78 

Poisson theor. 

02.0 

75.8 

45.8 

18.5 

7.3 






The last entry in each row includes the figures for higher classes and should be 
labeled il k or more.” 


The plate is divided into small squares. Table 8 reproduces the 
observed numbers of squares with exactly k dark spots in eight ex¬ 
periments with as many different kinds of bacteria. 14 The last column 
again indicates how many per cent of similar experiments conducted 
under ideal conditions must be expected to show a worse x 2 -agree- 
ment. We have here a representative of an important practical 
application of the Poisson distribution to spatial distributions of 
random points. 

14 The table is taken from J. Neynmn, Lectures and Conferences on Mathematical 
Statistics (mimeographed), Dept, of Agriculture, Washington, 1938. The original 
(by T. Matuszewski, J. Supinska, and J. Neynmn) appeared, together with related 
material, in Zentralblatt fur Baktei'iologie , Parasitenkunde und Infektionskrankheiten , 
II Abt., vol. 96 (1936). 



BINOMIAL AND POISSON DISTRIBUTIONS 


[6.7 


124 

I^The Multinomial Distribution 

The binomial distribution can easily be generalized to the case of n 
repeated independent trials where each trial can have one of several 
outcomes. Denote the possible outcomes of each trial by Ei, • • •, E r , 
and suppose that the probability of the realization of Ei in each trial 
is pi(i = 1, • • •, r). For r = 2 we have Bernoulli trials; in general, 
the numbers p t - are subject only to the condition 

(7.1) PI + • • • + Vr = 1, Vi > 0. 

The result of n trials is a succession like E 2 E\E 2 .... The probability 
that in n trials Ei occurs k x times , E 2 occurs k 2 times , etc., is 

n\ 

(7.2) ——-— Pi k W 2 Ps ki * * • Pr*'5 

K\ \K 2 I ’ * * rC 

here the ki are arbitrary non-negative integers subject to the obvious 
condition 

(7.3) k x + k 2 + • • • + k r = n. 

If r = 2, then (7.2) reduces to the binomial distribution with p x = p, 
p 2 — q, ki = k, k 2 — n — k. The proof in the general case proceeds 
along the same lines, starting with formula (4.7) of Chapter 2. 

Formula (7.2) is called the multinomial distribution because the 
right-hand member is the general term of the multinomial expansion of 
(pi + • • • + Pr) n - Its main application is to sampling with replace¬ 
ment when the individuals are classified into more than two categories 
(e.g., according to professions). 

Examples, (a) In rolling 12 dice, what is the probability of getting 
each face twice? Here E\ } • • •, E& represent the six faces, all k{ equal 
2, and all pi = 1/6. Therefore, the answer is (12!)(2)~ 6 (6)~ 12 
= 0.0034.... 

(6) The distribution of 800 random digits corresponds to a multi¬ 
nomial distribution with r — 10, n = 800, P\ — * • • = pio = 1/10. 
Among the 800 first decimals 16 of e the frequencies of occurrence of the 
digits 0,1, • • •, 9 are: 74, 73, 83, 86, 79, 67, 78, 82, 84, 94. The corre¬ 
sponding frequencies for 7r are 74, 92, 83, 79, 80, 73, 77, 75, 76,91. Of 
course, there is no logical reason why such expansions should have 
properties of random sampling digits. The fact is that the usual 
statistical tests would never raise the suspicion that the above figures 
were not obtained by performing 800 independent trials. 

16 The decimals of e are taken from Irdermldiaire des recherches math&matiques , 
vol. 2 (1946), p. 112; those of v from Mathematical Tables and Other Aids to Com- 
putation, vol. 2 (1947), p. 245, and vol. 3 (1948), p. 19. 



6 . 8 ] 


PROBLEMS FOR SOLUTION 


125 


(c) Multiple Bernoulli Trials . Two sequences of Bernoulli trials 
with probabilities of success and failure pi, q u and p 2y q 2y respectively, 
may be considered one compound experiment with four possible out¬ 
comes in each trial, namely, the combinations (S, S ), (S, F), (F, S) f 
(F, F). The assumption that the two original sequences are inde¬ 
pendent is translated into the statement that the probabilities of the 
four outcomes are p\p 2y p\q 2 , q\V 2 , respectively. If k u k 2) fc 3 , fc 4 

are four integers adding to n y the probability that in n trials SS will 
appear k x times, SF k 2 times, etc., is 


(7.4) 


n! 


k^lkslkj. 


„ ki + k 2n k 3 + k i/r . ki + ks n k 2 +ki 

Vl 'll V 2 T 2 


A special case occurs in sampling inspection . An item is conforming or 
defective with probabilities p and q. It may or may not be inspected 
with corresponding probabilities p f and q f . The decision of whether 
an item is inspected is made without knowledge of its quality, so that 
we have independent trials. [Cf. problems 30 and 31, and example 
(l.c) of Chapter 9.] 


8. Problems for Solution 

1. Assuming all sex distributions to be equally probable, what proportion of 
families with exactly six children should be expected to have three boys and three 
girls? 

2. A bridge player had no ace in three consecutive hands. Has he reason to 
complain of ill luck? 

3. How long has a series of random digits to be in order for the probability of 
the digit 7 appearing to be at least 9/10? 

4. How many independent bridge dealings are required in order for the proba¬ 
bility of a preassigned player having four aces at least once to be 1/2 or better? 
Solve again for some player instead of a given one. 

5. If the probability of hitting a target is 1/5 and ten shots are fired inde¬ 
pendently, what is the probability of the target being hit at least twice? 

6. In problem 5, find the conditional probability of the target being hit at least 
twice, assuming that at least one hit is scored. 

7. Find the probability that a hand of 13 bridge cards selected at random 
contains exactly 2 red cards and compare it with the corresponding probability 
in Bernoulli trials with p = 1/2. (For a description of bridge cards cf. footnote 1, 
Chapter 1.) 

8. What is the probability that the birthdays of six people fall in 2 calendar 
months leaving exactly 10 months free? (Assume independence and that all 
months are equally probable.) 

9. In rolling 6 true dice, find the probability of obtaining (a) at least one, ( b) 
exactly one, (c) exactly two, aces. Compare with the Poisson approximations. 

10. If there are on the average 1 per cent left-handers, estimate the chances of 
having at least four left-handers among 200 people. 



126 BINOMIAL AND POISSON DISTRIBUTIONS [6.8 

11. A book of 600 pages contains 600 misprints. Estimate the chances that a 
page contains at least three misprints. 

12. Colorblindness appears in 1 per cent of the people in a certain population. 
How large must a random sample (with replacements) be if the probability of its 
containing a colorblind person is to be 0.95 or more? 

13. In the preceding exercise, what is the probability that a sample of 100 will 
contain (a) no, (6) two or more, colorblind people? 

14. Estimate the number of raisins which a cookie should contain on the average 
if it is desired that the probability of a cookie to contain at least one raisin be 0.99 
or more. 

15. The probability of a royal flush in poker is p = 1/649,740. How large has 
n to be to render the probability of no royal flush in n games smaller than 
1/e « 1/3? (Note: No calculations are necessary for the solution.) 


16. Two people toss a true coin n times each. Find the probability that they 
will score the same number of heads. 

17. In a sequence of Bernoulli trials with probability p for success, find the 
probability that a successes will occur before b failures. (Note: The issue is decided 
after at most a + b — 1 trials. This problem played a role in the classical theory 
of games in connection with the question of how to divide the pot when the game 
is interrupted at a moment when one player lacks a points to victory, the other b 
points.) 

18. Show that for the largest term of the multinomial distribution we have 
| kj — npj | < r for all j. 

19. Inequalities for the 1 ‘tails” of the binomial distribution. Using (3.1), prove 


n 

( 8 . 1 ) b(k; n, p) + b(k + l;n,p) H-b b(n; n, p) < - - b(k; n, p) 

. _ . k — np 

for A; > np + 1, and 


(8.2) b(k ; n, p) + b(k — 1; n, p) H - h 6(0; n, p) < -- b(k ' n, p) 

np — k 

for k < np. 

(This theorem will be used in Chapter 7, section 2.) 

20. Further inequalities. As in section 3 let m be the central index. Prove for 
k > (n + 1 )p 


(8.3) 


b(k; n, p) 
b(k — 1; n, p) 


_ k-(n+ l)p 
< e n 


and hence, by multiplication, 

_ U+H-(n+l)P) 2 

(8.4) b(k; n, p) < 2e 2n b(m; n, p). 

21. Without further computations show that (8.4) holds also for k < np — 1. 

22. Verify the identity 

k 

(8.5) 22 b ( v ; «1, p)b(k - v; m , p) = b(k; n\ + n 2 , p) 


and interpret it probabilistically. 

Hint: Use formula (5.7) of Chapter 2. Equation (8.5) is a special case of con - 
volutions, to be taken up in Chapter 11; another example is (8.6). 

23. Verify the identity 



6 . 8 ] 


PROBLEMS FOR SOLUTION 


127 


k 

(8.6) X) P(v» )p(k v) X 2 ) = p(k; Xi -f X 2 ). 

v-»0 

24. Let 

k 

(8.7) B(fc; n, p) = 2 &("; n » P) 

v=0 

be the probability of at most A; successes in n trials. Then 

(8.8) B(k; 71 + ], p) = £(&; n, p) — n, p), 

B(A: + 1; n + 1, p) = B(k; n, p) -f qb(k + 1; n, p). 

Verify this (a) from the definition, (6) analytically. 

25. With the same notation 

(8.9) B(k; n , p) = (n — A;) J* — ()* dt 

and 

(8.10) 1 — B(fc; n, p) = n ^ J* ^(1 — t) n ~ k ~ l dt. 

Hint: Integrate by parts or differentiate both sides with respect to p. Deduce 
one formula from the other. 

Note: The integral in (8.10) is the inco?nplete beta function. Tables of 
1 — B(k; n, p) to 7 decimals for k and n up to 50 and p = 0.01, 0.02, 0.03, ... are 
given in K. Pearson, Tables of the Incomplete Beta Function , London (Biometrika 
Office), 1934. 

26. Deduce the binomial distribution (a) by induction, (6) from the general sum¬ 
mation formula (3.1) of Chapter 4. 

27. Prove Zkb(k; n, p) = np, and 2k 2 b(k; n, p) = n 2 p 2 + npq. 

28. Prove 2k 2 p(k; X) = X 2 -f X. 

29. Multiple Poisson distribution. When n is large and npj — Xy is moderate, 
the multinomial distribution (7.2) can be approximated by 

-(\H —+\ r ) hj k iM k 2 • • • xA 
6 h\fo\ ■ ■ • k r l 

Prove also that the terms of this distribution add to unity. 

30. Multiple Bernoulli trials. In example (7.c) find the conditional probabilities 
p and q of (S , F ) and (F, S), respectively, assuming that one of these combinations 
has occurred. Show that p > 1/2 or p < 1/2, according as pi > P 2 or P 2 > pi. 

31. Continuation . l6 If in n pairs of trials exactly m resulted in one of the com¬ 
binations OS, F) or (F, S), show that the probability that (S, F) has occurred exactly 
k times is b(k; m , p). 

18 A. Wald, Sequential Tests of Statistical Hypotheses, A nnals of Mathematical 
Statistics, vol. 16 (1945), p. 166. Wald uses the results given above to devise a 
practical method of comparing two empirically given sequences of trials (say, the 
output of two machines), with a view of selecting the one with the greater prob¬ 
ability of success. Using the above result, he reduces this problem to the simpler 
one of finding whether in a sequence of Bernoulli trials the frequency of success 
falls significantly short of 1/2 or exceeds 1/2. 



128 


BINOMIAL AND POISSON DISTRIBUTIONS 


[6.8 


32. Combination of the binomial and Poisson distributions . Suppose that the 
probability of an insect laying r eggs is p(r; A) and that the probability of an egg 
developing is p. Assuming mutual independence of the eggs, show that the proba¬ 
bility of a total of k survivors is given by the Poisson distribution with parameter 
Ap. 

Note : Another example for the same situation: the probability of k chromosome 
breakages is p(k; X), and the probability of a breakage healing is p. 

33. Polya’s scheme of contagion . The urn scheme [example (2.c) of Chapter 51 
permits the following obvious generalization. Successive (dependent trials) can 
each result only in S or F. In the first trial the probabilities of S and F are p and 
q. If the first n trials (n = 1, 2, ...) resulted in k successes and n — k failures, the 
conditional probabilities of S and F in the (n + l)st trial become (p + &y)/( 14 - ny) 
and (q + (n — k)y)/( 1 + ny). Here y > —1 is a constant replacing c/(b -f r) in 
the original scheme. Show that the probability of exactly k successes in the first n 
trials is given by 

( 8 . 11 ) ir(k; n) 

0 P(P + 7)(p + 2y) - (p+ ky — y)q{q + y)(q + 2g) ••• (q+ny—ky— y) 
1(1 + 7)(1 + 2y) - - • (1 + l»7 - 7) 


34. Continuation: Limiting form of Polya’s distribution. If n —> oo, p —> 0, 


0 so that np —* X, ny 


S then 




35. Continuation. As p 


oo the last expression tends to p(k; X). 



CHAPTER 7 


THE NORMAL APPROXIMATION TO THE 
BINOMIAL DISTRIBUTION 


1. The Normal Distribution 

In order to avoid later interruptions we pause here to introduce two 
functions of great importance. 

Definition. The function 


( 1 . 1 ) 


<t>(x ) = 


„*-*V2 


( 2 *) 




is called the normal density function) its integral 


( 1 . 2 ) 


$(x) = 


1 

(2t)^ 


f e~ u * /2 dy 


is the normal distribution junction. 


The graph of <l>(x) is the symmetric, bell-shaped curve shown in 
Figure 4. Note that different units are used along the two axes. The 
maximum of <£(x) is ( 2w)~~ H = 0.399, approximately, so that in an 
ordinary Cartesian system the curve y = </>(x) would be much flatter. 


Lemma L The domain bounded by the graph of and the x-axis 
has unit area } that is, 


(1.3) 



1. 


Proof . We have 


(1.4) | I 4>(x) dx = I I 4>(x)<t>(y) dx dy 

V_O0 J ^ — 00 d — oo 


1 

2 IT 


f +oo 

I g-(* 2 +y 2 )/2 


dxdy. 


129 



130 


NORMAL APPROXIMATION * 


[7.1 



h 50%pf , q rgfl < 
l JBa5%ofarea ■» [ 

i _ 95.6% of ore a _^ 

_ 99.7% of oreo _^ 

Figure 4. The Normal Density Function . 


This double integral can be expressed in polar coordinates thus: 


(1.5) 


1 dd 1 e~ r * /2 r dr = 

f e r * /2 r dr = 

—e" rV2 

Jq J o • 

'0 

l 



= 1 


which proves the assertion. 

It follows from the definition and the lemma that the function $(x) 
increases steadily from 0 to 1. Its graph (Figure 5) is an S-shaped 
curve with 


(1.6) $(-*) = 1 - *(x). 

Table 1 gives the values 1 of $(x) for positive x, and from (1.6) we get 
$(— x). 

For many purposes it is convenient to have an elementary estimate 
of the “tail,” 1 - $(x), for large x. Such an estimate is given by 

1 For larger tables cf. Tables of Probability Functions , Vol. 2, National Bureau of 
Standards, New York, 1942. There <t>(x) and <t>(x) — $(—z) are given to 15 deci¬ 
mals for x from 0 to 1 in steps of 0.0001 and for x > 1 in steps of 0.001. 



7.1] 


THE NORMAL DISTRIBUTION 


131 



Lemma 2? As x —»<*> 

(1.7) 1-$(*)•- ~--e- x * /2 ; 

(2 r) H x 

more ’precisely, for every x > 0 the double inequality 


( 1 . 8 ) 


J —vll _ JL1 


(2tt) 




(2jt) ! 


»/ 2.1 


Ao/cis (cf. problem 1). 

Proo/. By differentiation we may verify that 


(1-9) 


i ,,, i i r® ii 

27 r)^ a: (2w) 2 J x l y 2 J 


(2w) /2 x (2w) 

The integrand on the right side is greater than the integrand of 

1 


( 1 . 10 ) 


i — $(x )=—- r 

(2 t)»J, 


-V 2 /2 


dy, 


which proves the second inequality in (1.8). The first inequality 
follows in the same way, using as new integrand e~ v/2 {l — Z/y*\ 
which is smaller than e~ v ' >/2 . 

2 Here and in the sequel the sign ~ is used to indicate that the ratio of the two 
sides tends to one. 





132 


NORMAL APPROXIMATION 


[7.1 


TABLE 1 


The Normal Distribution 


t 

tf>(0 

*(0 

0.0 

0.398 942 

0.500 000 

0.1 

.396 952 

.539 828 

0.2 

.391 043 

.579 260 

0.3 

.381 388 

.617 911 

0.4 

.368 270 

.655 422 

0.5 

.352 065 

.691 462 

0.6 

.333 225 

.725 747 

0.7 

.312 254 

.758 036 

0.8 

.289 692 

.788 145 

0.9 

.266 085 

.815 940 

1.0 

.241 971 

.841 345 

1.1 

.217 852 

.864 334 

1.2 

.194 186 

.884 930 

1.3 

.171 369 

.903 200 

1.4 

.149 727 

.919 243 

1.5 

.129 518 

.933 193 

1.6 

.110 921 

.945 201 

1.7 

.094 049 

.955 435 

1.8 

.078 950 

.964 070 

1.9 

.065 616 

.971 283 

2.0 

.053 991 

.977 250 

2.1 

.043 984 

.982 136 

2.2 

.035 475 

.986 097 

2.3 

.028 327 

.989 276 

2.4 

.022 395 

.991 802 

2.5 

.017 528 

.993 790 

2.6 

.013 583 

.995 339 

2.7 

.010 421 

.996 533 

2.8 

.007 915 

.997 445 

2.9 

.005 953 

.998 134 

3.0 

.004 432 

.998 650 

3.1 

.003 267 

.999 032 

3.2 

.002 384 

.999 313 

3.3 

.001 723 

.999 517 

3.4 

.001 232 

.999 663 

3.5 

.000 873 

.999 767 

3.6 

.000 612 

.999 841 

3.7 

.000 425 

.999 892 

3.8 

.000 292 

.999 928 

3.9 

.000 199 

.999 952 

4.0 

.000 134 

.999 968 

4.1 

.000 089 

.999 979 

4.2 

.000 059 

.999 987 

4.3 

.000 039 

.999 991 

4.4 

.000 025 

.999 995 

4.5 

.000 016 

.999 997 



7.2] 


THE DeMOIVRE-LAPLACE LIMIT THEOREM 


133 


Note on Terminology. The term distribution function is used in the mathematical 
literature for any never-decreasing function F(x) which tends to 0 as x —► —«, 
and to 1 as x —> oo. Statisticians currently prefer the term cumulative distribution 
function , but the adjective “cumulative” is redundant. A density function is a 
non-negative function f(x) whose integral, extended over the entire x-axis, is unity. 
The integral from — oo to x of any density function is a distribution function. The 
older term frequency function is a synonym for density function. 

The normal distribution function is often called the Gaussian distribution , but 
it was used in probability theory earlier by DeMoivre and Laplace. If the origin 
and the unit of measurement are changed, then <t>(x) is transformed into 
— a)/b); this function is called the normal distribution function with mean o 
and variance b 2 (or standard deviation | b |). The function 24>(x2^) — 1 is often 
called error function. 

2. The DeMoivre-Laplace Limit Theorem 

Let S n stand for the number of successes in n Bernoulli trials with 
probability p for success. Then b(k ; n, p) is the probability of the 
event that S n = k. In practice we are usually interested in the prob¬ 
ability of the event that the number of successes lies between preassigned 
limits a and fi. If a and are integers and a < then this event is 
defined by the inequality a < S n < /3, and its probability is 

(2.1) Pr[a <S n <P\ 

= b(a; n, p) + b(a + 1; n, p) + ... + 6(/3; n, p). 

This sum may involve many terms, and a direct evaluation is usually 
impractical. Fortunately, whenever n is large, the normal distribution 
function can be used to derive simple approximations to the prob¬ 
ability (2.1). This discovery is due to DeMoivre 8 and Laplace. 4 We 
shall see that its importance goes far beyond the domain of numerical 
calculations. 

Our first aim is to derive an asymptotic formula for the individual 
terms 

(2.2) b(fc; n, p) = - pY~*. 

k\{n — k )! 

The probability p will be kept fixed, but we shall let n —> oo. As the 
number n of trials increases, we must expect that also the numbers of 
successes and failures will increase, and we are therefore interested 
mainly in such combinations of n and k , for which 

(2.3) n— >oo, k oo, n-k-* oo. 

* Abraham DeMoivre (1667-1754). His The Doctrine of Chances appeared in 
1718. 

4 Pierre S. Laplace (1749-1827). His Thioi ie analytique des probability appeared 
in 1812. 



134 NORMAL APPROXIMATION [7.2 

We may then express the factorials in (2.2) by means of Stirling's 
formula 8 


(2.4) 


r! ~ (2 tt) 


asr—>oo. We get 6 


(2.5) 


b(k; n, p) ~ 


n 

2irk(n — k) 



The last two factors on the right equal unity for k = np, and their 
product decreases as \ k — np\ increases. It is, therefore, natural 
to replace k by the new variable 

(2.6) h = k — np. 


[From Chapter 6, section 3, we know that the number of successes 
for which b(k; n, p) is greatest lies in the interval (np — q, np + p). 
Accordingly, 5* is approximately the deviation of the number k of 
successes from its most probable value.] With the notation (2.6) we 
have k = np + 8k and n — k = nq — 5*, so that (2.5) becomes 


(2.7) 


n, p) 


n 


2jt (np + 8 k )(nq - 8 k ) 


» 1 

(1 + 8 k /np) np+ \ 1 - 8 k /nq) m ~ Sk ' 


To evaluate the last fraction we pass to logarithms. In the interval 
| 8 k | < npq we may use Taylor’s expansion [Chapter 2, formula (6.9)], 
and find for the logarithm of the denominator 

(np + 8 k ) log (1 + 8 k /np) + (nq - 8 k ) log (1 - 8 k /nq) 


( 2 . 8 ) 


= (np + 8 k ) 


/«* _ 


8 k 2 


\np 2 n 2 p 2 


+ 


8 k d 


- (nq - S t ) ^ 


3 n 3 p 3 

h 


+ 


- + 
8 k 2 


nq 2n 2 q 2 3 n a q’ 


+ 


8 k 3 


n^/y 3 


+ 


-> 


8 It will be recalled that in Chapter 2 we did not complete the proof of Stirling’s 
formula but showed only that r\ ~ Cr r+ ^e~ r , where C is a positive constant. In 
the text it is assumed that C =* ( 2ir If we want to prove this fact, then the 
factor (2ir)^ in equations (2.5), (2.7), and (2.11) must be replaced by C. In this case 
a factor (2 '*)**/C must be inserted on the right sides in (2.14), (2.17), and (2.20). 
To show that this factor really equals 1 it suffices to choose xp and —x a very large. 
The right side in the modified equation (2.20) is then arbitrarily near to C/(2r)^, 
while the left side is near 1 (cf. Chapter 6, problem 19). 

• The symbol ~ indicates that the ratio of the t^vo sides tends to unity. 



7.2] THE DbMOIVRE-LAPLACE LIMIT THEOREM 

Reordering the terms according to powers of 5*, we get 


135 


(2.9) 


a^/i 1\ _ 

2n \p q) Qn 2 \p 2 



+ .... 


We now suppose that k increases with n in such a manner that 


( 2 . 10 ) 


h* 


0 . 


[In this case also h/n —> 0 so that (2.3) holds and the expansion (2.8) 
is justified.] It follows from (2.10) that the term within braces in 
(2.7) is asymptotically equivalent to (2tt npq)~**. The logarithm of 
the denominator in (2.7) is given by (2.9), but in view of (2.10) all 
terms except the first one may be neglected; this first term equals 
8k 2 /2npq. Combining these results, we have 

(2.11) b(k; n, p) ~ (2 

This is the desired asymptotic formula. We simplify it by the use 
of a more convenient notation. Put 


( 2 . 12 ) 


1 

(npq)* 


and define a function x k of the variable k by 


(2.13) x k =(k- np)h = -- -tv 

(npq) H 

In terms of these quantities and the normal density function <t>(x ) we 
can rewrite (2.11) in the form 

(2.14) b(k; n, p) ~ h<t>(x k ). 

This formula has been derived under the sole condition (2.10), which 
can be rewritten as 

(2.15) hxk 3 0. 

We have thus the 

Theorem . If n and k vary in such a way that (k — np) s /n 2 —»0, 
then the asymptotic formula (2.14) holds. 

The error committed in replacing b(k ; n, p) by h<t>{x k ) decreases with 



136 


NORMAL APPROXIMATION 


[7.2 


increasing npq. Figure 6 illustrates the theorem in the case n = 10, 
p = 0.2 where npq is only 1.6. It is seen that even in this extremely 
unfavorable case the approximation is surprisingly good. 7 



Figure 6. The Normal Approximation to the Binomial Distribution. The step 
function gives the probabilities b{k; 10, 1/5) of k successes in 10 Bernoulli trials 
with p 1/5. The continuous curve gives for each integer k the corresponding 

normal approximation. 

Our theorem leads directly to simple approximations for the sum 
(2.1). If 

(2.16) hxa —> 0 and hx^ —> 0, 

then (2.14) holds uniformly for all terms in (2.1), and therefore 

(2.17) Pr{a < S n < &} ~ h{<t>{x a ) + 4>(x a +i) + ... + 

The right side is a Riemann sum approximating an integral. The 
points x a , x a +i, • • •, xp are uniformly spaced at distance A, and the 
contribution h<t>(x *) is asymptotically equivalent to the area under 
the curve y = between Xk — A/2 and xu + A/2. In other words, 

7 The values of b(k; 10, 0.2) for k - 0, 1, • • •, 6 are 0.1074, 0.2684, 0.3020,0.2013, 
0.0880, 0.0264, 0.0055. The corresponding approximations h<f>(xk) are 0.0904, 
0.2307, 0.3154, 0.2307, 0.0904, 0.0189, 0.0021. 







7.2] THE DeMOIVRE-LAPLACE LIMIT THEOREM 137 

it is claimed that for all k of the interval a < k < 0 uniformly 

nxk+h/2 

(2.18) h<j>(x k ) ~ I 4>(x) dx = <f>(x* +M ) — 3>(x*_ w ). 

Jxk-h/2 

To verify (2.18) it suffices to note that the definition (1.1) of <l>(x) 
implies that in the interval Xk — h/2 < x < Xk + A/2 we have the 
double inequality 

(2.19) *(**)e -1 -** < 4>(x) < <t>(x k )e' Xk 1 i. 

Since Xkh— >0, the assertion (2.18) follows. If we add (2.18) for 
k = a, a + 1, • • *, 0 } then all intermediate terms on the right cancel, 
and only ^(x^^.^) — 4>(:r a _^) remains. We have thus proved the 

DeMoivre-Laplace limit theorem . If a and 0 vary so that hx a 3 —> 0 
and hxp 3 —> 0, then 

(2.20) Pr{a < S n < 0\ ~ 4>(.r0 + ^) — <K.r a _^), 

where h = ( npq )~^ and x t ~ (t — np)h . In words, the percentage 
difference between the two sides in (2.20) toads to zero together with hx^ 
and hx a s . 

In particular, (2.20) holds if a and 0 are restricted to values for 
which x a and x^ remain within a fixed interval. [The case where a 
and 0 are so large that the condition (2.16) is not satisfied will be 
discussed in section 5 and in problem 14.] 

In statistical applications (2.20) is usually used for values a and 0 for 
which | x a | and | xp | do not exceed 3 or 4. In theoretical applications 
it is often necessary to use (2.20) for intervals' (a, 0) which are far off 
the central part of the binomial distribution and for which both x a 
and x$ are large. In such cases both sides of (2.20) are small, and it 
becomes important to know that their ratio is near unity so that not 
only their absolute, but also their percentage, difference tends to zero. 

Examples, (a) Let p = 1/2, n = 200, a = 95, 0 = 105. In this 
case, x = Pr(95 < S n < 105} may be interpreted as the probability 
that in 200 tossings of a coin the number of heads deviates from 
100 by at most 5. A cumbersome direct computation 8 shows 
that x = 0.56325 To find the right side in (2.20) we calcu¬ 
late h = (50)~ m = 0.141421 ..and hence — = Xp+x = (5.5 )h 

= 0.7778 _From tables we get 4>(^ + ^) — $(.r a _^) = 0.56331 - 

8 J. V. Uspensky, Introduction to Mathematical Probability , New York, 1937, p. 
131. Uspensky uses the normal approximation — <t>0r a ), instead of $Cr/9+J4) 
— &(x a -y£) and gets 0.56050 ... with an error 0.0027 ... instead of our 0.00006. 



138 


NORMAL APPROXIMATION 


[7.2 


The error is 0.00006. This is less than should be expected in general. 

( 6 ) Let p = 1/10, n = 500, a = 50, p = 55. The correct value is 
Pr{50 < S n < 55} = 0.317573 .... Now h = (45)-* = 0.1490712 
and we get the approximation $>(5.5/&) — 4>(—0.5 h) = 0.3235 .... The 
error is about 2 per cent. 

( c ) Let n = 100 , p = 0.3. Table 2 shows in a typical example 
(for relatively small n) how the normal approximation deteriorates as 
the interval (a, /3) moves away from the central term. 


TABLE 2 

Comparison op the Binomial Distribution for n = 100, p — 0.3 and the 


Number of 

Normal Approximation 

Normal Ap- 

Percent¬ 

Successes 

Probability 

proximation 

age Error 

9 < S n 

< 11 

0.000 006 

0.000 03 

+400 

12 < S n 

< 14 

.000 15 

.000 33 

+ 100 

15 < Sn 

< 17 

.002 01 

.002 83 

+40 

18 < Sn 

< 20 

.014 30 

.015 99 

+12 

21 < Sn 

< 23 

.059 07 

.058 95 

0 

24 < Sn 

< 26 

.148 87 

.144 47 

—3 

27 <Sn 

< 29 

.237 94 

.234 05 

-2 

31 <Sn 

< 33 

.230 13 

.234 05 

+2 

34 < S n 

< 36 

.140 86 

.144 47 

+3 

37 < Sn 

< 39 

.058 89 

.058 95 

0 

40 < Sn 

< 42 

.017 02 

.015 99 

— 6 

43 < Sn 

< 45 

.003 43 

.002 83 

-18 

46 < Sn 

< 48 

.000 49 

.000 33 

-33 

49 < Sn 

< 51 

.000 05 

.000 03 

—40 


The limit theorem ( 2 . 20 ) takes on a simpler form if, instead of S n , 
we introduce the reduced number of successes defined by 


( 2 . 21 ) 


S * = *LZ^. 
n (.npq)* 


This amounts to measuring the deviations of S n from up in units of 
(npq) The quantity np is called the mean, and ( npq )** the standard 
deviation of S n ; this terminology is suggested by the theory of random 
variables (cf. Chapter 9). The inequality a < S n < p is the same as 
x a < S n * < £ 0 , and (2.18) states that for arbitrary fixed x a < 


(2.22) Pr[x a < S„* < Xfi) ~ ^ - $ » 



THE DeMOIVRE-LAPLACE LIMIT THEOREM 


139 


7.2] 

where h = (npq)~^. Now h —> 0 as n —»oo, and therefore the right side 
tends to $(xp) — <£(#«). Thus we have the following 

Corollary to the Limit Theorem. For every fixed a < b 

(2.23) Pr{a < S n * < b] -> 4>(&) - 4>(a). 

This is a weakened version of (2.20) but represents the traditional 
form of Laplace’s limit theorem. The dropping of h/2 in (2.22) intro¬ 
duces an error which tends to zero as n —»<*> but has a considerable 
influence when npq is of moderate magnitude (as is the case in the 
three examples (a)-(c)). 

The main fact revealed by (2.23) is that for large n the probability 
on the left is practically independent of p. This permits us to compare 
fluctuations in different series of Bernoulli trials simply by referring 
to our standard units. 

Examples, (d) Let us find a number a such that, for large n, the 
inequality | S n * | > a has a probability near 1/2. For this it is neces¬ 
sary that <h(a) — 4>( — a) = 1/2 or <£(a) = 3/4. From tables of the 
normal distribution we find that a = 0.6745, and hence the two 
inequalities 

(2.24) | S n — up | < 0.6745(/ipg)^ and | S n — np | > 0.6745(npg)^ 

are about equally probable. In particular, the probability is about 
1/2 that in n tossings of a coin the number of heads lies within the 
limits n/2 dh 0.337n^, and, similarly, that in n throws of a die the 
number of aces lies within the interval n/6 ± 0.251n^. The probabil¬ 
ity of S n lying within the limits np dt 2(npq) V2 is about 4>(2) — 4>(—2) 
= 0.9545 ..., and for np ± 3 (npq)' A the probability is 0.9973 - 

(e) A Competition Problem. This example illustrates practical appli¬ 
cations of formula (2.23). Two competing railroads operate one train 
each between Chicago and Los Angeles, which leave and arrive simul¬ 
taneously and have comparable equipment. We suppose that n 
passengers select trains independently and at random so that the 
number of passengers in each train is the outcome of n Bernoulli trials 
with p = 1/2. If a train carries s < n seats, then there is a positive 
probability f(s) that more than s passengers will turn up, in which 
case not all patrons can be accommodated. Using the approximation 
(2.23), we find 


(2.25) 


/2s — n\ 



140 


NORMAL APPROXIMATION 


[7.2 


If s is so large that f(s) < 0.01, then the number of seats will be suffi¬ 
cient in 99 out of 100 cases. More generally, the company may decide 
on an arbitrary risk level a and determine s so that f(s) < a. For that 
purpose it suffices to put 

(2.26) 8 > i(n + t a n H ), 

where t a is the root of the equation a = 1 — 4>( t a ), which can be found 
from tables. For example, if n = 1000 and a = 0.01, then t a » 2.33 
and s = 537 seats should suffice. If both railroads accept the risk 
level a = 0.01, the two trains will carry a total of 1074 seats of which 
74 will be empty. The loss due to competition (or chance fluctuations) 
is remarkably small. In the same way, 514 seats should suffice in 
about 80 per cent of all cases, and 549 seats in 999 out of 1000 cases. 

Similar considerations apply in other competitive supply problems. 
For example, if m movies compete for the same n patrons, each movie 
will put for its probability of success p = 1/m, and (2.26) is to be 
replaced by s> (1 /m)[n + t a n^(rn — 1)^]. The total number of 
empty seats under this system is ms — n ~ t a n^(m — 1)**. For 
a = 0.01, n = 1000, and m = 2, 3, 4, this number is about 74, 126, and 
147, respectively. The loss of efficiency because of competition is 
remarkably small. 

Theorem (2.23) is historically the first limit theorem of probability. From a 
modern point of view it is only an exceedingly special case of the central limit 
theorem , to which we shall return in Chapter 10 but whose general derivation must 
be postponed to the second volume. Statisticians use (2.23) as an approximation 
even where npq is relatively small, and in such cases an estimate of the error is 
desired. In Uspensky's book (cited in the last footnote) the reader will find 
an excellent bound for the error in the case of intervals (a, h) near the origin. The 
derivation requires Fourier analysis. Serge Bernstein devoted a series of papers 
to the investigation of the error term in the general case and discussed how the 
definition of x% should be modified in order to improve the convergence in (2.20). 
His papers are written in Russian and are difficult to obtain. A simplified derivation 
with an improvement of his results is, however, available in English. 9 

Note on Optional Stopping. It is essential to note that our limit 
and approximation theorems are valid only if the number n of trials is 
fixed in advance independently of the outcome of the trials. If a gambler 
has the privilege of stopping at a moment favorable to him, his ulti¬ 
mate gain cannot be judged from the normal approximation, for now 
the duration of the game depends on chance. For every fixed n it is 
very improbable that S n * is large. However, in the long run, even the 

9 W. Feller, On the Normal Approximation to the Binomial Distribution, Annals 
of Mathematical Statistics , vol. 16 (1945), pp. 319-329. 



7.3] 


THE LAW OF LARGE NUMBERS 


141 


most improbable thing is bound to happen, and we shall see that in a 
continued game S n * is practically certain to have a sequence of maxima 
of the order of magnitude (log log n)** (this is the law of the iterated 
logarithm of Chapter 8, section 5). 


3. The Law of Large Numbers 

On several occasions we have mentioned that our intuitive notion of 
'probability is based on the following assumption. If in n identical 
trials A occurs v times, and if n is very large, then v/n should be near 
the probability p of A. Clearly, a formal mathematical theory can 
never refer directly to real life, but it should at least provide theoretical 
counterparts to the phenomena which it tries to explain. Accordingly, 
we require that the vague introductory remark be made precise in 
the form of a theorem. For this purpose we translate “identical trials” 
as “Bernoulli trials” with probability p for success. Then—so we 
expect— S n /n should be near p. The Laplace limit theorem gives 
precise meaning to this statement. For every e > 0 the event 

— — p < e is the same as | S n — np | < en or I S n * | < t 
n 

Hence, for large n, 



(3.1) Pr 


V < « 


•(•GO-(-©")• 


and the right side is near unity. Thus we have one form of the law of 
large numbers which asserts that 


(3.2) 


Pr 




1 . 


In words: as n increases^ the probability that the average number of 
successes deviates from p by more than any preassigned € tends to zero. 
Its practical value depends entirely on the more precise form (3.1). 
In fact, even the use of (3.1) depends on the assurance that it is not 
only a limit relation, but that even for a given n the right side is a 
reasonable approximation to the left side. (Cf. the remarks concerning 
the error terms at the conclusion of the preceding section.) 

Examples* (a) Random Digits. According to Chapter 2, Table 1, 
the digit 7 appeared among 10,000 digits 968 times. In an ideal se¬ 
quence of 10,000 Bernoulli trials with p = 0.1 the standard deviation 

n 

is 30, and the probability of a deviation S n — --- > 32 is about 



142 


NORMAL APPROXIMATION 


[7.3 


-*( a ) + *(- a )-*{ 1 “• 


can 


be interpreted roughly by saying that in four out of ten ideal cases the 
deviation should be larger than the one observed. 

(6) Random Digits , Continued. In example (2 .b) of Chapter 2, we 
considered an event with p = 0.3024. In n = 1200 trials this event had 
an average frequency of 0.3142. The deviation from p is e = 0.0118. 
In this case ( pq )** = 0.4593 and e(n/pq)** « 0.890 .... Hence the 


probability of 


Sn 


V 


> e is in this case about 0.37.... This indi- 


I n | 

cates that in about 37 per cent of all cases the average number of 
successes should deviate from p by more than it does in our material. 

(c) Sampling . A fraction p of a certain population are smokers. 
Suppose that p is unknown and that random sampling with replace¬ 
ment is to be used to determine p. It is desired to find p with an error 
not exceeding 0.005. How large should the sample size n be? If p f is 
the fraction of smokers in the sample, we desire that | p f — p | < 0.005. 
However, no sample size can give absolute assurance that | p f — p | 
< 0.005; it is conceivable that the sample contains only smokers. 
Since absolute certainty is unattainable, we settle for an arbitrary 
confidence level a , say, a = 0.95, and require that \p' — p \ < 0.005 
with probability 0.95 or better. Note that np' is the number of suc¬ 
cesses in n trials, and hence 


Pr {| p f — p | < 0.005} = Pr 


S n 


n 


V 


< 0.005 • 


This means that n should be selected large enough to make the left 
side of (3.1), with e = 0.005, greater than 0.95. For the present 
purposes the # normal approximation is sufficient. The root x of 
$(rr) — $(— x) = 0.95 is x = 1.96 ..., and hence we should have 
0.005 (n/pq)^ > 1.96. Thus we are led to the inequality n > 392 2 pq 
or n > 160,000pg, approximately. This inequality depends on the 
unknown p, but pq never exceeds 1/4, and hence the sample size 
n = 40,000 would be safe under all circumstances; with it the odds 
are about 20 to 1 that | p' — p | < 0.005. If it is known in advance 
that p < 1/10, then pq < 9/100, and a sample size of 15,000 should 
suffice, etc. 


The theorem (3.2) is again only a special case of the general (weak) 
law of large numbers (cf. Chapter 10). A stronger and much more 
interesting theorem is the strong law of large numbers , to be proved in 
Chapter 8, section 4. 



7.4] 


RELATION TO THE POISSON APPROXIMATION 


143 


4. Relation to the Poisson Approximation 

The error of the normal approximation will be small if npq is large. 
On the other hand, if n is large and p small, the terms b(k; n , p) were 
found to be near the Poisson probabilities p(k ; X) with X = np. If X 
is small, then only the Poisson approximation can be used. However, 
if X is large, we can use either the normal or the Poisson approximation. 
This implies that for large values of X it must be possible to approxi¬ 
mate the Poisson distribution by the normal distribution, and in Chap¬ 
ter 10, example (l.c) we shall see that this is indeed so (cf. also problem 
9). Here we shall be content to illustrate the point by a numerical 
and a practical example. 


Examples, (a) Consider the Poisson distribution p(k; 100). We 
can take it as an approximation, say, to the binomial distribution with 
n = 100,000,000 and p = 1/1,000,000. Then also npq « 100; this 
quantity, even though not large, suffices for the normal distribution to 
give reasonable approximations at least for the central sector of 
the binomial distribution. The Poisson distribution p(k\ 100) agrees 
with b(k ; 10 8 , 10“ 6 ) to many decimals, and we can compare it with 
the normal approximation to the latter. Put, for brevity, P(a, b) 
= p(a; 100) + p(a + 1; 100) + • ■ • + p(b ; 100), so that P(a, b) stands 

for Pr{a < S n <b} and should be approximated by <£ 




approximation. 


The following sample gives an idea of the degree of 


P(85, 90) 
P(90, 95) 
P(95, 105) 
P(90, 110) 
P(110, 115) 
P(115, 120) 


Correct Values Normal Approximation 


0.113 84 
.184 85 
.417 63 
.70602 
.107 38 
.053 23 


0.110 49 
.179 50 
.417 68 
.706 28 
.110 49 
.053 35 


( b ) A Telephone Trunking Problem. The following problem is, 
with some simplifications, taken from actual practice. 10 A telephone 
exchange A is to serve 2000 subscribers in a nearby exchange B. It 
would be too expensive and extravagant to install 2000 trunklines from 
A to B . It will suffice to make the number N of lines so large that, 

10 E. C. Molina, Probability in Engineering, Electrical Engineering, vol. 54 (1935), 
pp. 423-427, or Bell Telephone System Technical Publications Monograph B-854. 
There the problem is treated by the Poisson method given in the text, which is 
preferable from the engineer’s viewpoint. 



144 


NORMAL APPROXIMATION 


[7.5 


under ordinary conditions, only one out of every hundred calls will 
fail to find an idle trunkline immediately at its disposal. Suppose that 
during the busy hour of the day on the average each subscriber re¬ 
quires a trunkline to B for 2 minutes. At a fixed moment of the busy 
hour we can reasonably compare the situation to a set of 2000 trials 
with a probability p = 1/30 in each that a line will be required. Under 
ordinary conditions these trials can be assumed to be independent 
(although this is not true when events like unexpected showers or 
earthquakes cause many people to call for taxicabs or the local news¬ 
paper; the theory no longer applies, and the trunks will be “jammed”). 
We have, then, 2000 Bernoulli trials with p = 1/30, and the smallest 
number N is required such that the probability of more than N “suc¬ 
cesses” will be smaller than 0.01; in symbols Pr{AS 2 ooo > N\ < 0.01. 

For the Poisson approximation we should take X = 200/3 « 66.67. 
From the tables we find that the probability of 87 or more successes 
is about 0.0097, whereas the probability of 86 or more successes is 
about 0.013. This would indicate that 87 trunklines should suffice. 
For the normal approximation we first find from tables the root x 
of 1 — $(x) = 0.01, which is x = 2.327. Then it is required that 
(N — % — np)/(npq> 2.327. Since n = 2000, p = 1/30 this means 
N > 67.17 + (2.327) (8.027) « 85.8. Hence the normal approximation 
would indicate that 86 trunklines should suffice. 

For practical purposes the two solutions agree. The method yields 
further practical results. Conceivably, the installation might be 
cheaper if the 2000 subscribers were divided into two groups of 1000 
each, and two separate groups of trunklines from A to B were installed. 
Using the above method, we find that actually some 10 additional 
trunklines would be required so that the first arrangement is more 
favorable. 

5. Large Deviations 11 

Frequently we desire an estimate of the probability that the reduced 
number of successes S n *[cf. (2.21)] exceeds a given number x. Hence 
the upper limit of the interval is infinity, and it requires a special 
argument to show that our limit theorem (2.20) still applies. 

Theorem . // n —> oo and x varies as a function of n in such a way 
that x —> °o but x 3 h 0, then 

(5.1) Pr{S n * > x} ~1 - *(*). 

11 The theorem is of general interest but will be used in this book only for the 
proof of the law of the iterated logarithm, Chapter 8. 



7.6] PROBLEMS FOR SOLUTION 

In view of (1.7) this is equivalent to 


145 


(5.2) 


Pr{S n * >*} 



e~ x * /2 . 


Proof. Choose in (2.20) the integers a and 0 so that x lies between 
x a and x a +\, and that xp « x + log x. Then xp*h —>0 and (2.20) 
holds. Hence 

(5.3) Pr{« < S n < 0} ~ {1 - *(x a )} - {1 - ^)}. 

However, from (1.7) and the fact that xp « x a + log x a it is readily 
seen that 1 — $>(xp) is of smaller order of magnitude than 1 — $(i a ), 
while 1 — 4>(:r a ) ~ 1 — 4>(;c). Hence 

(5.4) Pr{a < S n </3}~l — 4 >(.t). 

On the other hand, from (2.14) and formula (8.1) of Chapter 6 we have 

71 7lh? 

(5.5) Pr{ S n > i3\ < -— b(/3; n, p) ~ -— <*>(a*)- 

/3 — np xp 

Now nh 2 = 1 /pq is a constant, and 

1 

(5.6) — <l>(xp) ~ 1 — $(x.p). 

We saw that the right side tends to zero faster than 1 — 'l’(.r), which 
means that Pr { S n > |3) is of smaller order of magnitude than 1 — 4>(x). 
Combining this result with (5.4), we see then that 

(5.7) Pr{S n > a} ~1 - $(*), 
and this is our theorem. 

Further limit theorems for large deviations are given in problems 
12-17. 


6. Problems for Solution 

1. Generalizing (1.8), prove that 


( 6 . 1 ) 


1 — >I>(x) 


(2«-)« 


’/all _I , Li 

U x 3 + x 6 


1-3-5 


+ _...+ (-!)* 


1-3 


• • (2 k- 1) 

a-a+i 


and that for x > 0 the right side overestimates 1 — $(*) if k is even, and under- 
estimates if k is odd. 



146 


NORMAL APPROXIMATION 


(7.6 


2. For every constant a > 0 

(6.2) {l - 4> (x + ^)| -v- |l - *(a) j -» e“° 
as a; —> oo. 

3. Prove the inequality 

(6.3) he- h,/8 e~ xl/2 < f* + * / V*V a ctt < + **/so)/8 e -**/a 

Jx-V2 if ft <4.2^ 

[This improves (2.19).] 

4. Find an approximation to the probability that the number of aces obtained 
in 12,000 rollings of a die is between 1900 and 2150. 

5. Find a number k such that the probability is about 0.5 that the number of 
heads obtained in 1000 tossings of a coin will be between 440 and k. 

6. A sample is taken in order to find the fraction / of females in a population. 
Find a sample size such that the probability of a sampling error less than 0.005 
will be 0.99 or greater. 

7. In 10,000 tossings, a coin fell heads 5400 times. Is it reasonable to assume 
that the coin is skew? 


8. Find an approximation to the maximal term of the trinomial distribution 

—-~T-“j PlV(l - Pi - P2) n ~*~ r . 

A!r!(n — k — r)\ 

9. Normal approximation to the Poisson distribution. Using Stirling's formula, 
show that, if X -» «>, then for every fixed a < 0 

(6.4) E ,, P(fc; x) -» 

[Cf. Chapter 10, example (l.c).] 

10. Normal approximation to the hypergeometric distribution. Let n, m, k be 
positive integers and suppose that 


(6.5) 




V> 


w + m ' n m n -f- m 

where l/h = {(n + m)pqt( 1 — t )}Prove that 

OC-.) 


q, ft(ft - rp) 


( 6 . 6 ) 


( T ) 


h<f>(x). 


Hint : Use the normal approximation to the binomial distribution rather than 
Stirling's formula. % 

11. Normal distribution and combinatorial runs. 12 In Chapter 3, problem 15, 

M A. Wald and J. Wolfowitz, On a Test Whether Two Samples Are from the 
Same Population, Annals of Mathematical Statistics , vol. 11 (1940), pp. 147-162. 
For more general results, see A. M. Mood, The Distribution Theory of Runs, ibid., 
pp. 367-392. 



PROBLEMS FOR SOLUTION 


7.6] 


147 


we found that in an arrangement of n red and m black things the probability of 
having exactly k runs of red things is 


(n — 1\ (m + 1\ # (n + m\ 
\k - 1/ \ k ) + \ n ) 


Let n —> oo, m —> oo so that (6.5) holds. For fixed a < p the probability that 
the number of red runs lies between npq + a(pqn)^ and npq -f* Pipqn) X//2 tends to 
Mfi) - *(«). 

In the following problems h 2 = npq and S n * is the reduced number of successes 
defined in (2.21). Finally 

(6.8) F n (x) = Pr\S n * > x\. 

12. If x varies as a function of n so that x z+a h —> 0 but x —► oc, then 13 

(6.9) ■ = 1 + o(x a ), 

1 — 4>(;r) 


where o(x a ) stands for terms which are of smaller order of magnitude than x a . 

13. If x A h —> 0, x —> oo, then 14 for any constant a > 0 


( 6 . 10 ) 


Fn(x) - F n (x -f a/x) 
Fnfr) 


1 — e 


In words, the conditional probability of x < S n * < x + a/x y given that S n * > x, 
tends to 1 — e~ a . [Hint: Use (5.2).] 

14. Probabilities of large deviations. Starting with (2.5), prove the following 
theorem. If n —> qo, and k varies so that {k — np)/n 0, then 

h 

(6.11) b(k; n, p) ~ c~ 2 ~ M 


where x = (k — np)h and 

00 

(6.12) f(x) = E -—7 . 

^—3 v\y — 1) 

Note: If x 3 h -> 0, then f(x) -> 0, and (6.11) reduces to (2.14). If * is of the 
order of magnitude of but negligible as compared to h~^ y then 

(6.13) /(*) « ■e'h. 

If x is of the order of magnitude of h~'* y then 

(6.14) f(x) ~ ~ - - x 3 h + - ~~ xW, 
etc. 

13 N. Smirnov, tJber Wahrscheinlichkeiten grosser Abweichungen (in Russian, 
German summary), Receuil MaMmatique [ Sbornik] Moscou , vol. 40 (1933), pp. 
443-454. 

14 A. Khintchine, tlber einen neuen Grenzwertsatz der Wahrscheinlichkeitsrech- 
nung, Mathematische AnncUen, vol. 101 (1929), pp. 745-752. a. also problem 16. 



148 


NORMAL APPROXIMATION 


[7.6 


16. Continuation. Prove that if x —* °o, xh 0, 

(6.15) / (x + ^) -/(*) — 0 
and hence 

(6.16) F n (x) ~ *(*)!• 

16. Deduce (6.10) from (6.16), assuming only xh —» 0. 

17. If p > g, then for large x 

Pr{S n >x) < Pr{S n < -*}. 

(Hint: Use problem 14.) 



* CHAPTER 8 


UNLIMITED SEQUENCES OF BERNOULLI TRIALS 

This chapter discusses certain properties of randomness and the 
important law of the iterated logarithm for Bernoulli trials. The 
theory is of general interest, but the material covered in subsequent 
chapters is not connected with it. A different type of limit theorem 
for Bernoulli trials will be discussed in Chapter 12, section 5. 

1. Infinite Sequences of Trials 

In the preceding chapter we have dealt with probabilities connected 
with n Bernoulli trials and have studied their asymptotic behavior as 

—> oo. We turn now to a more general type of problem where the 
events themselves cannot be defined in a finite sample space. 

Example. A Problem in Runs. Let a and 0 be two positive integers, 
and consider a potentially unlimited sequence of Bernoulli trials, such 
as tossing a coin or throwing dice. Suppose that Paul bets Peter that 
a run of a consecutive successes will occur before a run of 0 consecutive 
failures. It has an obvious intuitive meaning to speak of the prob¬ 
ability of the event that Paul wins, but it must be remembered that 
in the mathematical theory the term event stands for “aggregate of 
sample points” and is meaningless unless an appropriate sample space 
has been defined. The model of a finite number, ft, of Bernoulli trials 
is insufficient for our present purpose, but the difficulty is solved by 
a simple passage to the limit. In n trials Peter wins or loses, or the 
game remains undecided. Let the corresponding probabilities be 
x ny y n , z n (x n + Vn + z n ~ 1). As the number n of trials increases, 
the probability z n of a tie can only decrease, while both x n and y n 
necessarily increase. Hence x = lim x n , y = lim y n , and z = lim z n 
exist. Nobody would hesitate to call them the probabilities of Peter’s 
ultimate gain or loss or of a tie. However, the corresponding three 
events are defined only in the sample space of infinite sequences of 
trials, and this space is not discrete. 

The example was introduced for illustration only, and the numerical values of 
x n} lfn> z n are not our immediate concern. We shall return to their calculation in 

* Starred chapters treat special topics and may be omitted at first reading. 

149 



150 


UNLIMITED BERNOULLI TRIALS 


[8.1 


Chapter 13, example (2.5). The limits x, y , z may be obtained by a simpler method 
which is applicable to more general cases. We indicate it here because of its im¬ 
portance and intrinsic interest. 

Let A be the event that a run of a consecutive successes occurs before a run of P 
consecutive failures. Then A means Peter’s winning and x — Pr\A\. If u and 
v are the conditional probabilities of A under the hypotheses, respectively, that the 
first trial results in success or failure, then x — pu + qv [Chapter 5, (1.9)]. Suppose 
first that the first trial results in success. In this case the event A can occur in 
* a — 1 mutually exclusive ways: (1) The following a — 1 trials result in successes; 
the probability for this is p a ~ l . (2) The first failure occurs at the ?th trial where 
2 < v < a. Let this event be II v . Then Pr[II„ j = p p ~~ 2 q, and Pr{A | //„) =* v. 
Hence (using once more the formula for compound probabilities) 

(1.1) u - p a ~ l + Qv(l -f V + • * * V a ~ 2 ) = V a ~ l + 0(1 ~ P° _1 ). 


In the case of the first trial resulting in failure a similar argument leads to 
(1.2) v = pu{ 1 + q H-b <f ~ 2 ) = u( 1 - 


We have thus two simple linear equations for the two unknowns u and v, and find 
for x = pu + qv 


(1.3) 


v 


4 -< 


/s-i _ pa-y-i 


To obtain y we have only to interchange p and q, and a and (3. Thus 


(1.4) 


2 / = f-' 


1 - p a 

T ~qP~ l - 


Since x + y — 1, we have 2=0 : the probability of a tie is zero. 

For example, in tossing a coin (p = 1 /2) the probability that a run of two heads 
appears before a run of three tails is 0.7; for two consecutive heads before four con¬ 
secutive tails the probability is 5/6, for three consecutive heads before four con¬ 
secutive tails 15/22. In rolling dice there is probability 0.1753 that two consecutive 
aces will appear before five consecutive non-aces, etc. 


In the present volume we are confined to the theory of discrete 
sample spaces, and this means a considerable loss of mathematical 
elegance. The general theory considers n Bernoulli trials only as the 
beginning of an infinite sequence of trials. A sample point is then 
represented by an infinite sequence of letters S and F, and the sample 
space is the aggregate of all such sequences. A finite sequence, like 
8SFS , stands for the aggregate of all points with this beginning, that is, 
for the compound event that in an infinite sequence of trials the first 
four result in S, S, F y S } respectively. In the infinite sample space the 
game of our example can be interpreted without a limiting process. 
Take any point, that is, a sequence SSFSFF .... In it a run of a 
consecutive $’s may or may not occur. If it does, it may or may not 
be preceded by a run of P consecutive F’s. In this way we get a classi- 



8.2] 


SYSTEMS OF GAMBLING 


151 


fication of all sample points into three classes, representing the events 
“Peter wins,” “Peter loses,” “no decision.” Their probabilities are 
the numbers x , y y z , computed above. The only trouble with this 
sample space is that it is not discrete, and we have not yet defined 
probabilities in general sample spaces. 

Note that we are discussing a question of terminology rather than a 
genuine difficulty. In our example there was no question about the 
proper definition or interpretation of the number x. The trouble is 
only that for consistency we must either decide to refer to the number x 
as “the limit of the probability x n that Peter wins in n trials” or else 
talk of the event “that Peter wins,” which means referring to a non¬ 
discrete sample space. We propose to do both. For simplicity of 
language we shall refer to events even when they are defined in the 
infinite sample space; for precision, the theorems will also be formu¬ 
lated in terms of finite sample spaces and passages to the limit. The 
events to be studied in this chapter share the following salient feature 
of our example. The event “Peter wins,” although defined in an 
infinite space, is the union of the events “Peter wins at the nth trial” 
(n = 1, 2, ...), each of which depends only on a finite number of 
trials. The required probability x is the limit of a monotonic sequence 
of probabilities x n which depend only on finitely many trials. We 
require no theory going beyond the model of n Bernoulli trials; we 
merely take the liberty of simplifying clumsy expressions 1 by calling 
certain numbers probabilities instead of using the term “limits of 
probabilities.” 

2. Systems of Gambling 

The painful experience of many gamblers has taught us the lesson 
that no system of betting is successful in improving the gambler’s 
chances. If the theory of probability is true to life, this experience 
must correspond to a provable statement. 

For orientation let us consider a potentially unlimited sequence of 
Bernoulli trials and suppose that at each trial the bettor has the free 
choice of whether or not to bet. A “system” consists in fixed rules 
selecting those trials on which the player is to bet. For example, the 
bettor may make up his mind to bet at every seventh trial or to wait 

1 For the reader familiar with general measure theory the situation may be de¬ 
scribed as follows. We consider only events which either depend on a finite number 
of trials or are limits of monotonic sequences of such events. We calculate the 
obvious limits of probabilities and clearly require no measure theory for that pur¬ 
pose. However, only general measure theory shows that our limits are independent 
of the particular passage to the limit and are completely additive. 



152 


UNLIMITED BERNOULLI TRIALS 


[8.2 


as long as necessary for seven heads to occur between two bets. He 
may bet only after head runs of length 13, or bet for the first time after 
the first head, for the second time after the first run of two consecutive 
heads, and generally, for the fcth time, just after k heads have appeared 
in succession. In the latter case he would bet less and less frequently. 
Another possible system consists in betting whenever the accumulated 
number of heads exceeds the accumulated number of tails (in which 
case the average player would bet on tails). We need not consider the 
stakes at the individual trials; we want to show that no “system” 
changes the bettor’s situation and that he can achieve the same result 
by betting every time. It goes without saying that this statement can 
be proved only for systems in the ordinary meaning where the bettor 
does not know the future (the existence or non-existence of genuine 
prescience is not our concern). It must also be admitted that the rule 
“go home after losing three times” does change the situation, but we 
shall rule out such uninteresting systems. 

We define a system as a set of fixed rules which for every trial uniquely 
determines whether or not the bettor is to bet; at the kth trial the decision 
may depend on the outcomes of the first k — 1 trials } but not on the outcome 
of trials number k, k + 1, k + 2, ...; finally the rules must be such as to 
ensure an indefinite continuation of the game. The last condition means 
the following. Since the set of rules is fixed, the event “in n trials the 
bettor bets more than r times” is well defined and its probability 
calculable. It is required that for every r, as n —> «>, this probability 
tends to 1. 

We now formulate our fundamental theorem to the effect that under 
any system the successive bets form a sequence of Bernoulli trials with 
unchanged probability for success. With an appropriate change of 
phrasing this theorem holds for all kinds of independent trials; the 
successive bets form in each case an exact replica of the original trials, 
so that no system can affect the bettor’s fortunes. The importance of 
this statement was first recognized by von Mises, who introduced the 
impossibility of a successful gambling system as a fundamental axiom. 
The present formulation and proof follow Doob. 2 For simplicity we 
assume that p = 1/2. 

Let Ak be the event “first bet occurs at the &th trial.” Our definition 
of system requires that as n oo the probability tends to one that the 
first bet has occurred before the nth trial. This means that 
PrJAi} + Pr\A 2 \ + • • • + Pr{A n \ -+ 1, or 

(2.1) SPrfA*} = l. 

* J. L. Doob, Note on Probability, Annals of Mathematics y vol. 37 (1936), pp. 
363-367. 



SYSTEMS OF GAMBLING 


153 


8:2] 

Next, let Bk be the event “head at fcth trial.” Then the event B 
“when first bet is made the trial results in heads” is the union of 
the events AiB t , A 2 B 2l A 2 B 3) ... which are mutually exclusive. 
Now Ak depends only on the outcome of the first k — 1 trials, 
and Bk only on the trial number k . Hence Ak and Bk are inde¬ 
pendent and Pr[A k B k 1 = Pr{Ak\Pr{B k \ = %Pr{A k \. ThusPr{J5} = 
XPr[A k B k \ = ffiPr{A k \ = 1/2. This shows that under this sys¬ 
tem the probability of heads at the first bet is 1/2, and the same 
statement holds for all subsequent bets. 

It remains to show that the bets are statistically independent. 
This means that the probability that the coin falls heads at both the 
first and the second bet should be 1/4 (and similarly for all other com¬ 
binations and for the subsequent trials). To verify this statement 
let Ak be the event that the second bet occurs at the fcth trial. Let E 
represent the event “heads at the first two bets”; it is the union of all 
events AjBjA k B k (if j < fc, then Aj and A k are mutually exclusive 
and AjAk = 0). Therefore 

00 00 

(2.2) Pr{E) = E Z Pr{AjBjA k 'B k }. 

i 1 

As before, we see that for fixed j and k > j y the event B k (heads at kth 
trial) is independent of the event AjBjA k (which depends only on 
the outcomes of the first k — 1 trials). Hence 

j « ao 

(2.3) Pr[E\ = - E £ PAATAS] 

2 j = i *=,-+i 

= i E PrlAjBj] E Pr[A k ' | AjB } \ 

* y =i k=*j+i 

[cf. Chapter 5, (1.9)]. Now, whenever the first bet occurs and whatever 
its outcome, the game is sure to continue, that is, the second bet occurs 
sooner or later. This means that for given AjBj with Pr[AjBj } > 0 
the conditional probabilities that the second bet occurs at the kth trial 
must add to unity. The second series in (2.3) is therefore unity, and 
we have already seen that S Pr{AjBj} = 1/2. Hence Pr{E\ = 1/4 
as contended. A similar argument holds for any combination of trials. 

Note that the situation is different when the player is permitted to 
vary arbitrarily the amounts which he puts down. Under such con¬ 
ditions there exist advantageous and disadvantageous strategies, and 
the game depends on the strategy. We shall return to this point in 
Chapter 14, section 2. 



164 


UNLIMITED BERNOULLI TRIALS 


[8.3 


3. The Borel-Cantelli Lemmas 

Two simple lemmas concerning infinite sequences of trials are used 
so frequently that they deserve special attention. We formulate 
them for Bernoulli trials, but they apply to more general cases. Lemma 
1 is used in section 4; lemma 2 in sections 4 and 5. 

We refer again to an infinite sequence of Bernoulli trials. Let 
Ai, A 2 , ... be an infinite sequence of events each of which depends 
only on a finite number of trials; in other words, we suppose that there 
exists an integer n k such that A k is an event in the sample space of 
the first n k Bernoulli trials. Put 

(3.1) a k = Pr{A k \. 

(For example, A k may stand for the event that trial number 2 k con¬ 
cludes a run of at least k consecutive successes. Then n k = 2k and 
a k = p k .) 

For every particular infinite sequence of letters S and F it is possible 
to establish whether it belongs to 0, 1, 2, ... or infinitely many among 
the {A*}. This means that we can speak of the event t/ r , that an 
unending sequence of trials produces more than r among the events 
{A*}, and also of the event U Q0 , that infinitely many among the {A fc } 
occur. The event U r is defined only in the infinite sample space, and 
its probability is the limit of Pr{U n%r ) } the probability that n trials 
produce more than r among the events {A*}. Finally, Pr {£/<*} = 
lim Pr{U r }; this limit exists since Pr{U r } can only decrease as r 
increases. 

Lemma 1 . If 2 a k converges , then with probability one only finitely 
many events A k occur . More precisely, it is claimed that for r sufficiently 
large, Pr{U r ) < c or: to every e > 0 it is possible to find an integer r such 
that the probability that n trials produce one or more among the events 
A r+ 1 , A r + 2 > • •. w less than efor all n. 

Proof. Determine r so that a r+ i + a r+2 + ... < e; this is possible 
since 2 a k converges. Without loss of generality we may suppose that 
the A k are ordered in such a way that n x < n 2 < n 3 < .... Let N be 
the last subscript for which n^ < n. Then A u • • •, An are defined in 
the space of n trials, and the lemma asserts that the probability that 
one or more among the events A r+i , A r+2 > • • •, An occur is less than €. 
This is true, since by the fundamental inequality (6.6) of Chapter 1, 
we have for this probability 

(3.2) Pr{A r+ i U A r+2 U • • • U A#} < a r +i + a r+2 + • • • + a# < e, 
as contended. 



8.4] 


THE STRONG LAW OF LARGE NUMBERS 


155 


A satisfactory converse to the lemma is known only for the special 
case of mutually independent A k . This situation occurs when the 
trials are divided into non-overlapping blocks and A k depends only 
on the trials in the fcth block (for example, A k may be the event that 
the fcth thousand of trials produces more than 600 successes). 

Lemma 2. If the events A k are mutually independent , and if 2a*. 
diverges , then with probability one infinitely many A k occur. In 
other words, it is claimed that for every r the probability that n trials 
produce more than r among the events {A*} tends to 1 as n —*oo # 

Proof. As in the proof of lemma 1 let A\, A 2 , •••, An be the 
events defined in the sample space of n trials. The probability 
that none of them occurs is, because of the assumed independence, 
(1 — ai)(l — a 2 ) • • • (1 — a at). Now 1 — x < e"” x forO < x < l,and 
hence (1 — ai)(l — a 2 ) • ■ • (1 — a at) < e~ (oi+ ° 2 + *‘ ,+OAr) ; with in¬ 
creasing N the last quantity tends to zero.’ We have thus proved that 
with probability one at least one among the {A^} occurs. 

Next, divide the sequence j A k \ into two subsequences { A k } and 
[A k "\ so that both series 2Pr{A*/} and 2 Pr{A k "\ diverge. Applying 
our result to these subsequences we find that, with probability one, at 
least one A k and one A k occur. Therefore there is probability one 
that at least two among the {A k J occur. Applying, in turn, this state¬ 
ment to the sequences {A k \ and {A k r } we find that at least four 
among the {A k } are bound to occur, etc. 

Example. What is the probability that in a sequence of Bernoulli 
trials the pattern SFS appears infinitely often? Let A k be the event 
that the trials number fc, fc + 1, and fc + 2 produce the sequence SFS. 
The events A k are obviously not mutually independent, but the 
sequence A*, A 4 , A 7 , Ai 0 , ... contains only mutually independent 
events (since no two depend on the outcome of the same trials). Since 
a k = p 2 q is independent of fc, the series a x + a 4 + a 7 + ... diverges, 
and hence with probability one the pattern SFS occurs infinitely often. 
A similar argument obviously applies for arbitrary patterns. 

4. The Strong Law of Large Numbers 

The intuitive notion of probability is based on the expectation that 
the following is true: if the number of successes in the first n trials of a 
sequence of Bernoulli trials is S n , then 

Sn 

-* P- 

n 


(4.1) 



156 


UNLIMITED BERNOULLI TRIALS 


[8.4 


In the abstract theory this cannot be true for every sequence of trials; 
in fact, our sample space contains a point representing the conceptual 
possibility of an infinite sequence of uninterrupted successes, and 
for it S n /n = 1. However, it is demonstrable that (4.1) holds with 
probability one, so that the cases where (4.1) does not hold form a 
negligible exception. 

Note that we deal with a statement which is much stronger than the 
weak law of large numbers [Chapter 7, (3.2)]. The latter says that for 
every sufficiently large fixed n the average S n /n is likely to be near p, 
but it does not say that S n /n is bound to stay near p if the number 
of trials is increased. It leaves open the possibility that in n additional 
trials at least one of the events S n+1 /(n + 1) < p — e, or S n + 2 /(n + 2) 
< p — 6, ..., or S 2n / 2ft < p — e, occurs; the probability of this is the 
sum of a large number of probabilities of which we know only that 
they are individually small. We shall now prove that with probability 
one S n /n — p becomes and remains small. 

Strong Law of Large Numbers. For every e > 0 we have probability 
one that only finitely many of the events 


(4.2) 


Sn 

n 


V 


> € 


occur. This implies that (4.1) holds with probability one. In terms 
of finite sample spaces, it is asserted that to every e > 0, 5 > 0 there 
corresponds an r such that for all v the probability of the simultaneous 
realization of the v inequalities 


(4.3) 


Sr+fc 
r + k 


< *, 


k - 1, 2, 


*> v 


is greater than 1 — 5. 

Proof. We shall prove a much stronger statement- 
event 


(4.4) 


Sk - kp 
(kpq) H 


> 2(log k) H . 


Then, by formula (5.2) of Chapter 7 


Let At, be the 


(4.5) 


Pr\A k \ 


, 1 e - 2log* < J_ 

2(2x)^(log k)* k 2 ’ 


and hence S Pr{Ak] converges. Thus lemma 1 of the preceding section 
ensures that with probability one only finitely many inequalities (4.4) 



8.5] THE LAW OF THE ITERATED LOGARITHM 167 

hold. On the other hand, (4.2) implies 


(4.6) 


Sn - np 
(npq) H 



n 




and for large n the right side is larger than 2(logn)^. Hence, the 
realization of infinitely many inequalities (4.2) implies the realization 
of infinitely many A & and has therefore probability zero. 

The strong law of large numbers was first formulated by Cantelli 
(1917), after Borel and Hausdorff had discussed certain special cases. 
Like the weak law, it is only a very special case of a general theorem 
on random variables. Taken in conjunction with our theorem on the 
impossibility of gambling systems, the law of large numbers implies 
the existence of the limit (4.1) not only for the original sequence of 
trials but also for all subsequences obtained in accordance with the 
rules of section 2. Thus the two theorems together describe the funda¬ 
mental properties of randomness which are inherent in the intuitive 
notion of probability and whose importance was stressed with special 
emphasis by von M ises. 


5. The Law of the Iterated Logarithm 

As in Chapter 7 let us again introduce the reduced number of suc¬ 
cesses in n trials 


(5.1) 


S n * 


Sn — np 
(npq)' A 


The Laplace limit theorem asserts that Pr{S n * > x\ ~ l — 3>(.r). 
Thus, for every particular value of n it is improbable to have a large 
S n *, but it is intuitively clear that in a prolonged sequence of 
trials S n * will sooner or later take on arbitrarily large values. Moderate 
values of S n * are most probable, but the maxima will slowly increase. 
How fast? In the course of the proof of the strong law of large numbers 
we have seen that with probability one 

(5.2) S n * <2(logn)* 

for all sufficiently large n : this gives us an upper bound for the fluctua¬ 
tions of S n *. More information is contained in the following remark¬ 
able theorem discovered by Khintchine. 3 

8 A. Khintchine, tJber einen Satz der Wahrscheinlichkeitsrechnung, Fundamenta 
Mathematical, vol. 6 (1924), pp. 9-20. The discovery was preceded by partial 
results due to other authors. The present proof is arranged so as to permit straight¬ 
forward generalization to more general random variables. 



158 


UNLIMITED BERNOULLI TRIALS 


[8.5 


Theorem. With probability one we have 


(5.3) 


lim sup 


Sn* 


(2 log log n)^ 


= 1 . 


This means: if X > 1, then there is probability one that only finitely 
many of the events 

(5.4) S n > np + X(2 npq log log n)* 

occur ; if \ < 1, then there is probability one that (5.4) holds for infinitely 
many n. 

For reasons of symmetry equation (5.3) implies that 


(5.3a) 


lim inf 


Sn* 


(2 log log »)** 


-1. 


Proof. We start with two preliminary remarks concerning Bernoulli 
trials. 

(1) There exists a constant c > 0 which depends on p, but not on n, 
such that 


(5.5) Pr{S n > np} > c 

for all n. In fact, an inspection of the binomial distribution shows that 
the left side in (5.5) is never zero, and the Laplace limit theorem shows 
that it tends to }/£ as n — > so. Accordingly, the left side is bounded 
away from zero, as asserted. 

(2) We require the following lemma: Let x be fixed, and let A be 
the event that for at least one k with k < n 

(5.6) Sk — kp > x. 

Then 

(5.7) Pr{A} <-Pr{S n -np>x }. 

c 

For a proof of the lemma let A v be the event that (5.6) holds for 
k = v but not for k = 1, 2, • • •, v — 1 (here 1 < v < n). The events 
Ai, A 2 , •••, A n are mutually exclusive, and A is their union. Hence 

(5.8) Pr\A\ = PrlAj) + ••• + Pr{A n \. 

Next, for v < n let U v be the event that the total number of successes 
in the trials number v + 1, v + 2, • ■ •, n exceeds (n — v)p. If both 
A, and U r occur, then S„ > S„ + (n — v)p > np + x, and since the 



8.5] THE LAW OF THE ITERATED LOGARITHM 159 

A V U V are mutually exclusive, this implies 
Pr{S n — np > x) 

(5.9) > PrlA^} + Pr{A 2 U 2 } + • • • + Pr{A n ^U n ^} + Pr[A n \. 

Now Ay depends only on the first v trials, and U v only on the following 
n — v trials. Hence A v and U v are independent, and Pr{A v Uy\ 
= Pr[A v )Pr{U v }. From the preliminary remark (5.5) we know 
that Pr{ U v | > c y and since c < 1, we get from (5.9) and (5.8) 

(5.10) Pr\S n -np> x\ > c^Pr[A v ) = cPr{A\. 

This proves (5.7). 

(3) We now prove the part of the theorem relating to (5.4) with 
X > 1. Let 7 be a number such that 

(5.11) 1 < 7 < \ H r 

and let n r be the integer nearest to y r (r = 1, 2, ...). Let B r be the 
event that the inequality 

(5.12) S n — np > X(2 n r pq log log n r ) H 

holds for at least one n with n r < n < n r+1 . Obviously (5.4) can hold 
for infinitely many n only if infinitely many B r occur. Using the first 
Borel-Cantelli lemma, we see therefore that it suffices to prove that 

(5.13) 2Pr{# r ) converges. 


By the lemma 

(5.14) jPr{J5 r } < c -1 Pr{ S„ r4 ., - n r+1 p > X(2 n r pq log log n r ) H ) 


^2 log log n r ^ J 


= 0-! Pr | S* Br+l > X I 

Now n r /n r +i ~ 7 -1 > X - ^, and hence for sufficiently large r 

(5.15) Pr[B r \ < c- 1 Pr{S\ +l > (2X log log n r )*\. 

From the DeMoivre-Laplace limit theorem [formula (5.2) of Chapter 
7] we get, therefore, for large r 

(5.16) Pr\B r \ < 1 - rr- / / “Tx‘ 

( s c(logn r ) x c(r log 7) x 

Since X > 1, the assertion (5.13) is proved. 



160 UNLIMITED BERNOULLI TRIALS [8.6 

(4) Finally, we prove the assertion concerning (5.4) with X < 1. 
This time we choose for 7 an integer so large that 

7 — 1 

(5.17) -> v > X 

7 

where 17 is a constant to be determined later. Put 

(5.18) n r = 7 r . 

The second Borel-Cantelli lemma applies only to independent events. 
For this reason we introduce 

(5.19) D r = - S nr . i; 

D r is the total number of successes following trial number n r _ 1 and 
up to and including trial n r ; for it we have the binomial distribution 
b(k; n, p) with n = n r — n r _ 1. Let A r be the event 

(5.20) D r - (n r - n r __ x ) p > 17 (2 pqn r log log n r ) A . 

We claim that with probability one infinitely many A r occur. Since the 
various A r depend on non-overlapping blocks of trials (namely, 
n r „i < n < n r ), they are mutually independent, and, according to 
the second Borel-Cantelli lemma, it suffices to prove that 

(5.21) EPr{A r } diverges. 

Now 


(5.22) Pr[A r } 


- Pr 


D r - (n r - n r _ x )p 


( n r \ H 

V [ 2-log log n r 1 

\ n r — n r _x / 


|(n r - n r -i)pqY A 

Here n r /(n r — n r _ 1) = 7/(7 — 1) < if” 1 , by (5.17). Hence 

, [D r — ( n r — n r _i)n .. 

(5.23) Pr{A r } > Pr ——± -—^ > (2, log log n r )* 

l J(n r — «-r— ^ 

Using again the estimate (5.2) of Chapter 7, we find for large r 


(5.24) Pr{A r \ > - e~ n lot log nr - 

2ij log log n r 2ij (log log n r ) (log n r )” 

Since nr = 7 r and r/ < 1, we find that for larger we have Pr{j4 r } > 1/r, 
which proves (5.21). 

The last step of the proof consists in showing that S*,,, in (5.19) can 
be neglected. From the first part of the theorem, which has already 



8.6] INTERPRETATION IN NUMBER THEORY LANGUAGE 161 

been proved, we know that to every e > 0 we can find an N so that, 
with probability 1 — e or better, for all r > N, 

(5.25) | - n r _i p | < 2(2pgn r _! log log n r _0*. 

Now suppose that rj is chosen so close to 1 that 

(5.26) l-r,< -J- 

Then from (5.17) 

(5.27) 4n r _! = 4 — < n r ( v - X) 2 

7 

and hence (5.25) implies 

(5.28) S- n r _ip > - (rj - X)(2 pqn T log log n r ) H . 

Adding (5.28) to (5.20), we obtain (5.4) with n = n r . It follows that, 
with probability 1 — e or better, this inequality holds for infinitely 
many r, and this accomplishes the proof. 

The law of the iterated logarithm for Bernoulli trials is a special 
case of a more general theorem first formulated by Kolmogorov. 4 It 
is now possible to formulate stronger theorems (cf. problems 5 and 6). 

6. Interpretation in Number Theory Language 

Let x be a real number in the interval 0 < x < 1, and let 

(6.1) x = .aia 2 a 3 - 

be its decimal expansion (so that each aj stands for one of the digits 
0,1, • • *, 9). This expansion is unique except for numbers of the form 
a/10 n , which can be written either by means of a terminating expansion 
(containing infinitely many zeros) or by means of an expansion con¬ 
taining infinitely many nines. To avoid ambiguities we now agree not 
to use the latter form. 

The decimal expansions are connected with Bernoulli trials with 
p = 1/10, the digit 0 representing success and all other digits failure. 
If we replace in (6.1) all zeros by the letter S and all other digits by F, 
then (6.1) represents a possible outcome of an infinite sequence of 
Bernoulli trials with p = 1/10. Conversely, an arbitrary sequence of 
letters S and F can be obtained in the described manner from the expan¬ 
sion of certain numbers x. In this way every event in the sample space 
of Bernoulli trials is represented by a certain aggregate of numbers x. 

4 A. Kolmogoroff, Das Gesetz des iterierten Logarithmus, Mathematische Annalen, 
vol. 101 (1929), pp. 126-135. 



162 


UNLIMITED BERNOULLI TRIALS 


[8.6 


For example, the event “success at the nth trial” is represented by all 
those x whose nth decimal is zero. This is an aggregate of lO 71 "” 1 
intervals each of length 10 “ n , and the total length of these intervals 
equals 1/10, which is the probability of our event. Every particular 
finite sample sequence of length n corresponds to an aggregate of certain 
intervals; for example, the sequence SFS is represented by the nine 
intervals 0.01 < x < 0.011, 0.02 < x < 0.021, • • •, 0.09 < x < 0.091. 
The probability of each such sample sequence equals the total length 
of the corresponding intervals on the x-axis. Probabilities of more 
complicated events are always expressed in terms of probabilities of 
finite sample sequences, and the calculation proceeds according to the 
same addition rulq which is valid for the usual Lebesgue measure on 
the x-axis. Accordingly, our probabilities will always coincide with 
the measure of the corresponding aggregate of points on the x-axis. 
We have thus a means of translating all limit theorems for Bernoulli 
trials with p = 1/10 into theorems concerning decimal expansions. 
The phrase “with probability one” is equivalent to “for almost all x” or 
“almost everywhere.” 

We have considered the random variable S n which gives the number 
of successes in n trials. Here it is more convenient to emphasize the 
fact that S n is a function of the sample point, and we write S n (x ) for 
the number of zeros among the n first decimals of x. Obviously S w (x) is 
a function of x whose graph is a step polygon whose discontinuities are 
necessarily points of the form a/10 w , where a is an integer. The ratio 
S n (x)/n is called the frequency of zeros among the first n decimals of x. 

In the language of ordinary measure theory the weak law of large 
numbers asserts that S n (x)/n —>1/10 in measure, while the strong 
law states that S n (x)/n —> 1/10 almost everywhere. Khintchine’s law 
of the iterated logarithm shows that 


( 6 . 2 ) 


S n (x) - 7l/10 
hmsup (n log log n)' A 


(0.3)2** 


for almost all x. It gives an answer to a problem treated in a series 
of papers initiated by Hausdorff 6 (1913) and Hardy and Littlewood 6 
(1914). For a further improvement of this result see problems 5 and 6. 

Instead of the digit zero we may consider any other digit and can 
formulate the strong law of large numbers to the effect that the fre¬ 
quency of each of the ten digits tends to 1/10 for almost all x. A similar 

5 F. Hausdorff, Grundziige der Mengenlehre , Leipzig, 1913. 

•Hardy and Littlewood, Some Problems of Diophantine Approximation, Acta 
MathemaHca , vol. 37 (1914), pp. 155-239. 



8.7] 


PROBLEMS FOR SOLUTION 


163 


theorem holds if the base 10 of the decimal system is replaced by any 
other base. This fact was discovered by Borel (1909) and is usually 
expressed by saying that almost all numbers are “normal.” 


7. Problems for Solution 

1. Find an integer (3 such that in rolling dice there are about even chances that 
a run of three consecutive aces appears before a non-ace run of length 0. 

2. Consider repeated independent trials with three possible outcomes A, B, C 
and corresponding probabilities p, q, r (p + q + r = 1). Find the probability 
that a run of a consecutive A’s will occur before a B-run of length 0. 

3. Continuation. Find the probability that an A-run of length a will occur 
before either a B-run of length /3 or a C-run of length y. 

4. In a sequence of Bernoulli trials let A n be the event that a run of n consecu¬ 
tive successes occurs between the 2 n th and the 2 n+1 th trial. If p > 1/2, there is 
probability one that infinitely many A n occur; if p < 1/2, then with probability 
one only finitely many A n occur. 

5. Let <t>{t) be a positive monotonically increasing function, and let n r be the 
nearest integer to e r</log r . If 


(7.1) 


£ — 

^ 0(n r ) 




converges, then with probability one, the inequality 

(7.2) S n > np + (npq)^(n) 

takes place only for finitely many n. [Note that without loss of generality we may 
suppose that </>(n) < 10(log log n)^; the law of the iterated logarithm takes care 
of the larger <t>(n).] 

6. Prove 7 that the series (7.1) converges if, and only if, 

(7.3) 


converges. [Hint: Collect the terms n r -1 < n < n r and note that n r — n r -i 
~ n r (l — 1/log r); furthermore, (7.3) can converge only if 0 2 (n) > 2 log log n.] 

7 Problems 5 and 6 together show that in case of convergence of (7.3) the in¬ 
equality (7.2) holds with probability one only for finitely many n. Conversely, if 

(7.3) diverges, the inequality (7.2) holds with probability one for infinitely many n. 
This converse is much more difficult to prove; cf. W. Feller, The General Form of the 
So-called Law of the Iterated Logarithm, Transactions of the American Mathemat¬ 
ical Society, vol. 54 (1943), pp. 373-402, where more general theorems are proved for 
arbitrary random variables. For the special case of Bernoulli trials with p = 1/2 
cf. P. Erdos, On the Law of the Iterated Logarithm, Annals of Mathematics (2), 
vol. 43 (1942), pp. 419-436. The law of the iterated logarithm follows for the 
particular case <t>(t) = X(2 log log t). 



CHAPTER 9 


RANDOM VARIABLES; EXPECTATION 


1. Random Variables 


According to the definition given in calculus textbooks, the quantity 
y is called & function of the real number x if to every x there corresponds 
a value y. This definition can be extended to cases where the inde¬ 
pendent variable is not a real number. Thus we call the distance a 
function of a pair of points; the perimeter of a triangle is a function 
defined on the set of triangles; a sequence a n is a function defined for 


all positive integers; the binomial coefficient 



is a function defined 


for pairs of numbers (x, k) of which the second is a non-negative 
integer. In the same sense we can say that the number S n of successes 
in n Bernoulli trials is a function defined on the sample space; to each 
of the 2 n points in this space there corresponds a number S n . 

A function defined on a sample space is called a random variable. 
Throughout the preceding chapters we have been concerned with 
random variables without using this term. Typical random variables 
are the number of aces in a hand at bridge, of multiple birthdays in a 
company of n people, of success runs in n Bernoulli trials. In each 
case there is a unique rule which associates a number X with any 
sample point. The classical theory of probability was devoted mainly 
to a study of the gambler’s gain, which is again a random variable; in 
fact, every random variable can be interpreted as the gain of a real or 
imaginary gambler in a suitable game. The position of a particle under 
diffusion, the energy, temperature, etc., of physical systems are random 
variables; but they are defined in non-discrete sample spaces, and their 
study is therefore deferred. In the case of a discrete sample space we 
can actually tabulate any random variable X by enumerating in some 
order all points of the space and associating with each the correspond¬ 
ing value of X. 

The term random variable is somewhat confusing, and random 
function would be more appropriate (the independent variable being a 
point in sample space, i.e., outcome of an experiment). The conceptual 
confusion is increased by the tradition of denoting random variables 

164 



9.1] 


RANDOM VARIABLES 


165 


by single letters without referring to sample points; the same custom 
now prevails in many branches of mathematics, however. 

Let AT be a random variable and let x u x 2y z 3 , ... be the values which 
it assumes (in most of what follows, the Xj will be integers). In general 
the same value xj may correspond to several sample points. Their 
aggregate forms the event that X = Xj] its probability is denoted by 
Pr{X = Xj\. The system of relations 

(LI) Pr{X= Xj ] =f( Xj ) (j = 1, 2, ...) 

defines the (; probability ) distribution 1 of the random variable X. Clearly 

(1.2) f(xj) > 0, 2 f[xf) = 1. 

If a value x is never assumed, we agree to write Pr{X — x) =0. 

Examples. The number S n of successes in n Bernoulli trials is a 
random variable whose distribution the binomial distribution 
b(k;n, p). Another example of a random variable is the number X of 
aces in a random sample of r bridge cards. The possible values of X 
are 0,1, 2, 3, 4, and the corresponding probabilities are given by formula 

(5.3) and Table 3 of Chapter 2. Each line of this table represents a 
probability distribution. 

Consider now two random variables X and Y defined on the same 
sample space, and denote the values which they assume, respectively, 
by Xi, x 2 , ..., and 2 / 1 , 2 / 2 , ...) let the corresponding probability dis¬ 
tribution be \f(xj)} and \g(yic)\- The aggregate of points in which 
the two conditions X — xj and Y = yk are satisfied forms an event 
whose probability will be denoted by Pr{X = xj, Y = yk\- The system 
of equations 

(1.3) Pr{X = xj, Y = y k \ = p(x j} y k ) (j,k = 1,2, ■■ •) 

1 For a discrete variable X the probability distribution is the function f{xj) 
defined on the aggregate of possible values t; of X '. This term must be distinguished 
from the term “distribution function,“ which applies to any non-decreasing func¬ 
tion F(x) which tends to 0 as x —> — oo and to 1 as x r-> oo. The distribution 
function F(x) of -AT is defined by 

F(x) = Pr{X < x\ - £ f(xj), 

Xj<,X 

the last sum extending over all those Xj which do not exceed x. Thus the distribu¬ 
tion function of a variable can be calculated from its probability distribution and 
vice versa. In this volume we shall not be concerned with distribution functions 



166 


RANDOM. VARIABLES ' 


[9.1 


is called the joint probability distribution of X and Y. It is best exhibited 
in the form of a double-entry table as exemplified in Table 1. (Note, 
however, that the numbers of rows and columns need not be equal.) 
Clearly 

(1.4) p(x jt y k ) >0, £ P( x h Vk) = 1. 

j.k 

Moreover, for every fixed j 

(1.5) p(xj, yi) + p(xj, y 2 ) + p(x Jt y 3 ) + • • • = Pr{X = xj} = f(xj) 
and for every fixed k 

(1.6) p(x i, y k ) + p(x 2 , y k ) + p(x 3 , y k ) + • • • = Pr[Y = y k ] = g(y k ). 

In other words, by adding the probabilities in individual rows and 
columns, we obtain the probability distributions of X and F. They 
may be exhibited as shown in Table 1 and are then called marginal 
distributions . The adjective “marginal” refers to the outer appearance 
in the double-entry table and is also used for stylistic clarity when the 


TABLE 1 

Joint Distribution of the Numbers of Aces in Two Specified Hands at 

Bridge 



X 

F 

0 

1 

2 

3 

4 

Marginal 
Distribu¬ 
tion for Y 

0 

0.055 222 

0.124 850 

0.093 637 

0.027 467 

0.002 641 

0.303 817 

1 

.124 850 

.202 881 

.097 383 

.013 734 

0 

.438 848 

2 

.093 637 

.097 383 

.022 473 

0 

0 

.213 493 

3 

.027 467 

.013 734 

0 

0 

0 

.041 201 

4 

.002 641 

0 


0 


.002 641 

Marginal 
Distribu¬ 
tion for X 

.303 817 

.438 848 

.213 493 

.041 201 

l 

.002 641 



E{X) - E(Y) - 1; Var(JT) « Var(K) = 0.705 885..., 
Cov(jr, Y) - -0.235 295..., p(X, Y) « -1/3. 









9.1] 


RANDOM VARIABLES 


167 


joint distribution of two variables and also their individual (marginal) 
distributions appear in the same context. Strictly speaking, the 
adjective “marginal” is always redundant. 

The notion of joint distribution carries over to systems of more 
than two random variables. 


Examples. [For numerical examples of joint distributions cf. Table 1 
and the answer to problem 1.] 

(а) Equation (5.8) of Chapter 2 gives the probability that a sample 
of size r contains X = k x elements of the first class and F = k 2 ele¬ 
ments of the second class, that is, the joint distribution of X and Y. 
We get the marginal distribution Pr{X = ki} if we keep k x fixed 
and add over k 2 = 0, 1, • • •, r. In view of (5.7) of Chapter 2 this 
operation leads to the ordinary hypergeometric distribution. 

(б) In n throws of a die let X , Y , Z, U, V, W be the numbers of 
ones, twos, •••, sixes. The joint distribution of these six variables 
follows from the multinomial distribution (7.2) of Chapter 6. It is 
given by 


(1.7) 


p(ki , k 2y k'Sj A; 4 , ks, k$) 


n! 6“ n 


AtiUfeUfel^Ufelfce! 


provided that k x + k 2 + ■ • • + /b 6 = n; for every other combination 
the probability is zero. For given X , • *, V there is only one possible 

value for W. Keeping k u k 2 , fc 3 , fc 4 fixed and adding over 
k*> = 0, 1, 2, •••, (n — ki — k 2 — k 3 — fc 4 ), we get the joint distribu¬ 
tion of the four variables X, F, Z, £/, or 


(1.8) p(k u k 2 , k 3 , fc 4 ) 


! 2 n — 


ki\k 2 \k^\k4}.{n — k x — k 2 — fc 3 — fc 4 )! 


6 “ n . 


Adding over /c 4 = 0, 1, • • •, n — k\ — k 2 — k 3 , we get the joint dis¬ 
tribution of X y F, Z, 


(1.9) 


p(jki, k 2 , k 3 ) 


n\3 n ~ kl ~ ki - k * 

- : - 6 ~ n . 

ki\k 2 \k 3 \(n — k x — k 2 — fc 3 )! 


Proceeding in this way, we arrive at the binomial distribution 
b(ki] n , for X alone. In this case all six variables have a common 

(marginal) distribution. 

(c) Sampling Inspection . As in Chapter 6, example (7.c), suppose 
that items are subjected to sampling inspection. We have a double 
classification: an item is acceptable or defective, and it is or is not 
inspected. The corresponding probabilities are p, g, and p', g', respec¬ 
tively (p + q = 1, p' + g' = 1). We are concerned with double 



168 


RANDOM VARIABLES 


[9.1 


Bernoulli trials in which the four classes (“acceptable and inspected,” 
etc.) have probabilities pp', pq', p'q, p'q'. Suppose now (as in Dodge’s 
sampling plan) that items are sampled until the first defective item is 
discovered, and consider the following two random variables: the 
number N of items passing the inspection desk before this discovery, 
and the number K of defectives among them (which have passed undis¬ 
covered). The event N = n, K = k occurs if k out of n items are 
defective but not inspected, and the (n + l)st item is both defective 
and inspected. Therefore the joint distribution of N and K is given by 

(1.10) p(n,k) = Pr[N = n, K = k\ = Q p n -\qq') k -qp', 

where n>0, 0<fc<n(iffc>n the binomial coefficient vanishes). 
Summing over all k, we get from the binomial formula 

(1.11) f(n) = (p + qqTqp' = (i - qpTqp'; 

this is the probability of the event N = n, that n items pass control 
before the first defective item is discovered. Summing over all n > k, 
we get [using formula (9.1) of Chapter 2 and the binomial formula] 

(1.12) g(k) = {qq'fqp’ ]£ ( ^ ^ (~ PY = = 

Of course, the marginal distributions (1.11) and (1.12) could be derived 
directly. (To be continued in problem 15.) 


With the notation (1.3) the conditional probability of the event 
Y = yjcj given that X = xj, becomes 


(1.13) 


Pr{Y = y k \ X = xj\ = 


pfa, Vk) 
Kxj) 


As a glance at Table 1 shows, this conditional probability is in general 
different from g(yk)- This indicates that inference can be drawn from 
the values of X to those of Y and vice versa; the two variables are 
(statistically) dependent. The strongest degree of dependence exists 
when Y is a function of X, that is, when the value of X uniquely deter¬ 
mines Y. For example, if a coin is tossed n times and X and Y are 
the numbers of heads and tails, then Y — n — X. Similarly, when 
Y - X 2 , we can compute Y from X. In the joint distribution this 
means that in each row all entries but one are zero. If, on the other 
hand, p(xj, y^) = f(xj)g(yk) for all combinations of Xj, yk, then the 
events X = Xj and Y = yk are independent; the joint distribution 



9.1] 


RANDOM VARIABLES 


169 


assumes the form of a multiplication table. In this case we speak of 
independent random variables. They occur in particular in connection 
with independent trials, for example, if X and Y are the numbers 
scored in two throws of a die. Note that the joint distribution of X 
and Y determines the distributions of X and Y, but that we cannot 
calculate the joint distribution of X and Y from their marginal dis¬ 
tributions. If two variables X and Y have the same distribution, they 
may or may not be independent. For example, the two variables X 
and Y in Table 1 have the same distribution and are dependent. On 
the other hand, if X and Y had the same meaning but referred to two 
independent bridge games, the marginal distributions would be the 
same but X and Y would be independent, and the joint probability 
distribution would assume the form of a multiplication table. 

All our notions apply also to the case of more than two variables. 
We recapitulate in the formal 

Definition. A random variable X is a*function defined on a given 
sample space , that is , an assignment of a real number to each sample 
point. The system of equations (1.1) defines the (; probability ) distribution 
of X. If a particular number x does not occur among the values assumed 
by X , then we write Pr{X — x} = 0. If two or more random variables 
X\, X 2j • • •, X n are defined on the same sample space , their joint distribu¬ 
tion is given by the system of equations which assigns probabilities to all 
combinations X x = Xj v X 2 = x h , etc. The variables X lf • • •, X n are called 
mutually independent if for any combination of values x JV x ; - 2 , • • •, x Jn 

(1.14) Pr{X x = x h , X 2 = x h , ■ • •, X n = ayj 

= Pr{X = x h \Pr{X 2 = ay,} • • • Pr\X n = ay B |. 

In Chapter 5, section 4, we have defined the sample space corre¬ 
sponding to n mutually independent trials. Comparing this definition 
to (1.14), we see that if Xk depends only on the outcome of the kth trial y 
then the variables X x , • • •, X n are mutually independent. More generally, 
if a random variable V depends only on the outcomes of the first k 
trials, and another variable V depends only on the outcomes of the 
last n — k trials, then U and V are independent. (Cf. problem 25.) 

We may conceive of a random variable as a labelling of the points 
of the sample space. This procedure is familiar from dice, where the 
faces are numbered and we speak of numbers as the possible outcomes 
of individual trials. In conventional mathematical terminology we 
could say that a random variable X is a mapping of the original sample 
space onto a new space whose points are x Xy x 2 , .... It is therefore 



170 


RANDOM VARIABLES 


[9.1 


legitimate to talk of a random variable X , assuming the values Xi, X 2 , ... 
with 'probabilities f(x 1 ), f{x 2 ), ... without further reference to the old 
sample space; a new one is formed by the sample points x u x 2 , 
Specifying a probability distribution is equivalent to » specifying a sample 
spate whose points are real numbers. Speaking of two independent 
random variables X and Y with distributions \f(xj)} and {g{yk)\ is 
equivalent to referring to a sample space whose points are pairs of numbers 
(xj, yk) to which probabilities are assigned by the rule Pr{(xj> yk)) 
= f(xj)g(yk). Similarly , for the sample space corresponding to a set 
of n random variables (Xi, • • *, X n ) we can take an aggregate of points 
(xj l9 Xj v • • •, Xj n ) in the n-dimensional space to which probabilities are 
assigned by the joint distribution. The variables are mutually independent 
if their joint distribution is given by (1.14). 

It is clear that the same distribution can occur in conjunction with 
different sample spaces. If we say that the random variable X assumes 
the values 0 and 1 with probabilities 1/2, then we refer tacitly to a 
sample space consisting of the two points 0 and 1. However, the varia¬ 
ble X might have been defined by stipulating that it equals 0 or 1 
according as the tenth tossing of a coin produces heads or tails; in 
this case X is defined in a sample space of sequences ( HUT ...), and 
this sample space has 2 10 points. 

In principle, it is possible to restrict the theory of probability to 
sample spaces defined in terms of probability distributions of random 
variables. This procedure avoids references to abstract sample spaces, 
and also to terms like “trials” and “outcomes of experiments.” The 
reduction of probability theory to random variables is a short-cut to 
the use of analysis and simplifies the theory in many ways. However, 
it also has the drawback of obscuring the probability background. The 
notion of random variable easily remains vague as “something that 
takes on different values with different probabilities.” But random 
variables are ordinary functions, and this notion is by no means peculiar 
to probability theory. 

Example, (d) Let AT be a random variable with possible values 

x\> x 2f ... and corresponding probabilities f(x{), f(x 2 ) ) _ If it helps 

the reader’s imagination, he may always construct a conceptual 
experiment leading to X. For example, subdivide a roulette wheel into 
arcs li, l 2 , ... whose lengths are as f(x 1 ) :f(x 2 ) :.... Imagine a gambler 
receiving the (positive or negative) amount Xj if the roulette comes to 
rest at a point of lj. Then X is the gambler’s gain. In n trials, the gains 
are assumed to be n independent variables with the common distribu¬ 
tion \f(xj)}. To obtain two variables with a given joint distribution 




9.2] 


EXPECTATIONS 


171 


\v( x h Vk)\ let an arc correspond to each combination (xj, y&) and 
think of two gamblers receiving the amounts Xj and yk, respectively. 

IfX, Y , Z , ... are random variables defined on the same sample 
space, then any function F{X , Y, Z, ...) is again a random variable. 
Its distribution can be obtained from the joint distribution of 
X, Y, Z y ... simply by collecting the terms which correspond to com¬ 
binations of ( X , 7, Z y ...) giving the same value of F(X y Y f Z y ...). 

Example, (e) In the example illustrated by Table 1, the sum 
S = X + Y represents the total number of aces in two specified bridge 
hands. It can assume the values 0, 1, 2, 3, 4, and the correspond¬ 
ing probabilities can be found in Table 1. Thus the event S = 0 
is the same as (X = 0, Y = 0), and its probability is 0.055222. For 
reasons of symmetry this is also the probability of the event S = 4; 
however, the latter event can occur in five different ways, and using 
only Table 1 we get for its probability 0.022473 + 2(0.013734 
+ 0.002641) = 0.055223. Similarly, Pr{S = 1} = 2(0.124850) = 
0.249700, and this happens to be the same as Pr{S = 3}. Finally 
p r {S = 2} = 0.202881 + 2(0.093637) = 0.390155. The product XY 
is another random variable assuming the values 0, 1, 2, 3, 4. The 
event XY =* 0 occurs if either X or Y vanishes, and its probability is 
therefore the sum of all entries in the first row and the first column. 
The distribution of XY is given by /(0) = 0.552412, /(1) = 0.202881, 
/(2) = 0.194766, /(3) = 0.027468, /(4) = 0.022473. 

2. Expectations 

To achieve reasonable simplicity it is often necessary to describe 
probability distributions rather summarily by a few “typical values. ,, 
Among these the expectation or mean is by far the most important. 
It lends itself best to analytical manipulations, and it is preferred by 
statisticians because of a property known as sampling stability. Its 
definition follows the customary notion of an average. If in a certain 
population n* families have exactly fc children, the total number of 
families is n = n 0 + n x + n 2 + ... and the total number of children 

m = 7ii + 2n 2 + 3n 3 +_ The average number of children per 

family is m/n. The analogy between probabilities and frequencies 
suggests the following 

Definition . Let X be a random variable assuming the values x\, 
x 2 , •.. with corresponding probabilities f(x i), /(x 2 ), .... The mean or 
expected value of X is defined by 

(2.1) E(X) - S**/(**) 



RANDOM VARIABLES 


[9.2 


172 

provided that the series converges absolutely. In this case we say that 
X has a finite expectation. If 2 | x k | f(x k ) diverges, then we say that X 
has no finite expectation . 

It goes without saying that the most common random variables have 
finite expectations; otherwise the concept would be impractical. How¬ 
ever, variables without finite expectations occur in connection with 
important recurrence problems in physics. The terms mean, average, 
and mathematical expectation are synonymous . We also speak of the 
mean of a distribution instead of referring to the corresponding random 
variable. The notation E(X) is generally accepted in mathematics 
and statistics. In physics X, <X>, <X>Av are common substitutes 
for E{X). 

It should be noted that the definition (2.1) applies also if some 
among the numbers x k are equal; for we can collect the corresponding 
terms in the series without changing the sum. This observation 
becomes useful in connection with functions such as X 2 . This function 
is a new random variable assuming the values x 2 ) in general, the 
probability of X 2 = x 2 is not f(x k ) b\itf(x k ) + f( — x k ). Nevertheless 

(2.2) E(X 2 ) = 2 x k 2 f(x k ) 

provided that the series converges. More generally we get in the same 
way the 

Theorem. For any function <j>(x) we have a new random variable 4>(X) with 

(2.3) E(<f>(X)) = 2 <t>(x k )f(x k ), 

where the series converges absolutely if, and only if, E{<j>(X)) exists. For 
any constant a we have E{aX) — aE(X). 

If several random variables X lf •••, X n are defined on the same 
sample space, then their sum X x + • • • + X n is a new random vari¬ 
able. Its possible values and the corresponding probabilities can be 
readily found from the joint distribution of the X v and thus 
E(Xi + * * * + X n ) can be calculated. A simpler procedure is furnished 
by the following important 

Theorem. If X\, X 2 , • • *, X n are random variables with expectations, 
then the expectation of their sum exists and is the sum of their expectations: 

(2.4) i(Zi+- + X n ) = E{X x ) + * • • + E(X n ). 

Proof. It suffices to prove (2.4) for two variables X and Y. Using 
the notation (1.3), we can write 

(2.5) E{X) + E(Y) = £ xip{x h y k ) + £ y kV (x h y k ), 



9.3] 


• EXAMPLES AND APPLICATIONS 


173 


the summation extending over all possible values Xj, y & (which need 
not be all different). The two series converge absolutely, and their 
sum can therefore be rearranged to give Sy*. (xj + yic)v( x h V *)• How¬ 
ever, this is by definition the expectation of X + Y. This ac¬ 
complishes the proof. 

Clearly, no corresponding general theorem holds for products; for 
example, E(X 2 ) is generally different from ( E(X )) 2 . Thus, if X is the 
number scored with a balanced die, E(X) = 7/2, but E(X 2 ) 
= (1 + 4 + 9 + 16 + 25 + 36)/6 = 91/6. However, the simple 
multiplication rule holds for mutually independent variables. 

Theorem. If X and Y are mutually independent random variables with 
finite expectations , then their product is a random variable with finite 
expectation and 

(2.6) E(XY) = E(X)E(Y). 

Proof. To calculate E(XY) we should multiply each possible value 
Xjyk with the corresponding probability. We have already remarked 
that the values Xk in the definition (2.1) need not be different. Hence 

e ( xy ) = E 

j,k ‘ 

j k 

the rearrangement being justified since the series converge absolutely. 
This proves the theorem. By induction the same multiplication rule 
holds for any number of mutually independent random variables. 

3. Examples and Applications 

/ (a) Binomial Distribution. Let S n be the number of successes in n 
Bernoulli trials with probability p for success. We know that S„ has 
the binomial distribution {b(k; n } p)} [cf. Chapter 6, (2.1)]. Hence 
E(S n ) = 2 kb(k; n, p) = np H,b(k — 1; n — 1, p). The last sum in¬ 
cludes all terms of the binomial distribution for n — 1 and hence 
equals 1. Therefore the mean of the binomial distribution is 

(3.1) E(S n ) = np. 

The same result could have been obtained without calculation by a 
method which is often expedient. Let Xk be the number of successes 
scored at the kth. trial. This random variable assumes only the values 0 
and 1 with corresponding probabilities q and p. Hence E(Xk) 
= + 1-p = p, and since 

(3.2) S n .= X x + X 2 + • • • + X n , 



RANDOM VARIABLES 


174 


[9.3 


we get (3.1) directly from theorem (2.4). [Continuation in examples 
(4.d) and (5.a).] 

(b) Poisson Distribution . If X has the Poisson distribution 
p(k; X) = e~ x \ k /k\ [cf. Chapter 6, (5.1)], then E(X) = 2 kp(Jc; X) 
= X2 p(k — 1; X). The last series contains all terms of the distribution 
and therefore adds to unity. Accordingly, X is the mean of the Poisson 
distribution . [Continuation in example (4.c).] 

(c) In a sequence of Bernoulli trials let X be the number of trials 
up to and including the first success. Then X is a random variable 
whose probability distribution y v = Pr{X = v) is given by 


(3.3) 


% = q ” l p, 


V>1. 


This is the geometric distribution. For the mean we get 
(3.4) Sv T ,, = p(l + 2g + 3g 2 + ...)• 


On the right we have the derivative of a geometric series so that the 
sum is p(l — g)““ 2 = p"" 1 . Hence the mean of the number of trials up 
to and including the first success is E(X) = 1/p. The number X k of 
trials following the (k — l)st success and up to and including the fcth 
success is a random variable which obviously has the same distribution 
as X. The sum Xi + • • • + X n is the number of trials up to and includ¬ 
ing the nth success; its mean is n/p, by the addition theorem (2.4). 
Note that we have here calculated the mean of a random variable 
without knowing its distribution. The latter is the Pascal distribution 
to be discussed in Chapter 11, section 3. [Continuation in example 
(5.6).] 

(d) A Sampling Problem . A population of N distinct elements is 
sampled with replacement. Because of repetitions a random sample 
of size r will in general contain fewer than r distinct elements. As 
the sample size increases, new elements will enter the sample more and 
more rarely. We are interested in the sample size S r necessary for 
the acquisition of r distinct elements. (As a special case, consider the 
population of N = 365 possible birthdays; here S r represents the 
number of people sampled up to the moment where the sample con¬ 
tains r different birthdays. A similar interpretation is possible with 
random placements of balls into cells. Our problem is of particular 
interest to collectors of coupons or other items where the acquisition 
can be compared to random sampling. 2 ) 

2 G. Polya, Eine Wahrscheinlichkeitsaufgabe zur Kundenwerbung, Zeitschrift fur 
Angewandte Mathematik und Mechaniky vol. 10 (1930), pp. 96-97. Polya treats a 
slightly more general problem with different methods. 



9.3] 


EXAMPLES AND APPLICATIONS 


175 


■ »# 


The first element enters the sample at the first drawing. The number 
of drawings from the second up to and including the drawing at which 
a new element enters the sample is a random variable Xi; generally, 
let X r be the number of drawings following the selection of the rth 
element up to and including the selection of the next new element. 
Then S r = 1 + Xi + • • • + X r _i is the sample size at the moment 
that the rth element enters the sample. Once the sample contains k 
different elements the probability of drawing a new one is at each 
drawing p = (N — k)/N . The distribution of X & is therefore given 
by (3.3) and E(Xk) = 1/p = N/(N — k). Hence, from the addition 
theorem (2.4), 


(3.5) E(S r ) = N 


1 

“ + 


+ 


1 


N N - 1 N - 2 


+ ...+ 


—!—j. 

N — r + 1J 


For r = N we get the expected number of drawings necessary to 
exhaust the entire population. For N = 10 we have E(Siq) 
= 29.29..., and E(S 5 ) = 6.46.... This means that we can expect 
to have covered half the population in about 6 to 7 drawings, whereas 
the second half requires some 23 more drawings. A reasonable approxi¬ 
mation 3 to (3.5) for large N is 

(3.6) E(S r )~N log—-■■■■ - - .- 

N — r + 1 


In particular, for any fraction a < 1 the expected number of drawings 
required to obtain a sample containing about the fraction a of the entire 

1 

population is, for large N , approximately N log-; the expected 

1 — a 

number of drawings necessary to have all N elements included in the 
sample is, approximately , N log N. Note that our results are again 
obtained without use of the distribution. For sampling without re¬ 
placement the same method is applicable [cf. problem 14; the present’ 
example is continued in section 5, (c); it is treated with different 
methods in Chapter 11, problems 5 and 6]. 

(e) A bowl contains balls numbered 1 to N. Let X be the largest 
number drawn in n drawings when random sampling with replacement 
is used. The event X < k means that each of n numbers drawn is less 


3 If the tangent and rectangle rules for numerical integration are applied to (3.5), 
one finds that 


N log 


n + v 2 


N + 1 


N — r + V 2 


> E(Sr) > N log 


N - r + 1 



RANDOM VARIABLES 


176 


[9.3 


than or equal to k and therefore Pr{X <k) = (i k/N) n . Hence the 
probability distribution of X is given by 

p k = Pr{X = fc) = Pr{AT < k] - Pr{Z < fc - 1} 

= {fc n ~ (fc - iw-*. 

It follows that 

N N 

(3.7) E(X) = £ Ap* - AT-" £ {A B+1 — (A — 1) B+1 — (A — 1) B } 

/f«=l *=1 

V 

= N~ n {N n+1 - £ (A - 1) B }. 

k=sl 

For large iV the last sum is approximately the area under the curve 
y = x n from x = 0 to x = N, that is, N n+1 /(n + 1). It follows that 
for large 4 N 

(3.8) E(X)~-?—N. 

n + 1 

If a town has N = 1000 cars and a sample of n — 10 is observed, the 
expected number of the highest observed license plate (assuming 
randomness) is about 910. The converse problem of estimating the 
unknown true number N from the observed maximum in a sample 
occurs in routine statistical analysis. (Continuation in problems 6-8.) 

(/) Banach's Match-box Problem. In example (2 .h) of Chapter 6, 
we found the distribution 


(3.9) 



for the number X of matches left at the moment when the first box is 
found empty. We are unable to calculate the expectation E(X ) = m 
in a direct way, but the following indirect way is applicable in many 
similar cases. Using the fact that the u r add to unity (which is not 
easily verified), we find 

N ~ l N ~' (2N - r\ 1 

< 3 l0) * - " - .5 < w - ^ - ,5 <iV - '> U - r)i^' 


4 A more precise estimate follows from the tangent and trapezoid rules for numer¬ 
ical integration: 




9 . 4 ] 


THE VARIANCE 


177 


By a simple operation on the binomial coefficients the last sum is 
transformed into 


"- 1 /2N - r - 1\ 1 

(3.n> E J**: 


2N+l”- 1 1 N ~' 

--- Z Ur+1 ~ - Z (r + l)«r+l. 

■6 r =0 * r =0 


The last sum is identical with the sum defining n = E(X). In the 
first sum all u r except u 0 occur, and hence the terms add to 1 — u 0 . 
Thus from (3.10) and (3.11) 


2iV + 1 n 

(3.12) N - M = —— (1 - M 0 ) - ^ 

or 

2N + 1 /2JV\ 

(3.13) ».py+l)n > -l--^5-( Jf )-l. 


Using Stirling’s formula, we find 


(3.14) 



- 1 . 


In particular, in the distribution of Chapter 6, Table 2, we had N = 50. 
For it p = 7.04_ 


^4. The Variance 

Let AT be a random variable with distribution j/(xy}, and let r > 0 
be an integer. If the expectation of the random variable X r , that is, 

( 4 . 1 ) E(X r ) = Zxffixj), 

exists, then it is called the rih moment of X about the origin. If the series 
does not converge absolutely, we say that the rth moment does not 
exist. Since | X | r_1 < | X\ r + 1, it follows that whenever the rth 
moment exists so does the (r — 1 )st, and hence all preceding moments. 

Moments play an important role in the general theory, but in the 
present volume we shall use only the second moment. If it exists, so 
does the mein 

( 4 . 2 ) M = E(X). 

It is then natural to introduce instead of the random variable its 
deviation from the mean, X — p. Since (x — p) 2 < 2(x 2 + p 2 ) we see 



RANDOM VARIABLES 


178 


[ 0.4 


that the second moment of X — n exists whenever E(X 2 ) exists. We 
find 


(4.3) E«X - m) 2 ) = E (x/ - 2nXj + m 2 ) /(*>). 

y 


Splitting the right side into three individual sums, we find it equal to 
E(X 2 ) - 2/z£(J!0 + M 2 = #(* 2 ) ~ M 2 . 

Definition. Let X be a random variable with second moment E(X 2 ) 
and let /i = E(X) be its mean. We define a number called the variance of X 
by 


(4.4) Var(AT) = E((X - M ) 2 ) = EiX 2 ) - M 2 

Its positive square root (or zero) is called the standard deviation of X. 

For simplicity we often speak of the variance of a distribution 
without mentioning the random variable. “Dispersion” is a synonym 
for the now generally accepted term “variance.” 

Examples, (a) If X assumes the values ±c, each with probability 
1/2, then Var(AT) = c 2 . 

(b) If X is the number of points scored with a symmetric die, then 
Var(*) = i(l 2 + 2 2 + • • - + 6 2 ) - (7/2) 2 = 35/12. 

(c) For the Poisson distribution p(k\\) the mean is X [cf. 3, (b)] 
and hence the variance SA; 2 p(k ; X) — X 2 = XZkp(k — 1; X) — X 2 
- X2(* - 1 )p(k - 1; X) + X2p(fc - 1; X) - X 2 - X 2 + X - X 2 = X. 
In this case mean and variance are equal. 

(d) For the binomial distribution [cf. 3, (a)] a similar computation 
shows that the variance is 

2 k 2 b(k;n } p) — (np) 2 = nphkb(k — l;n — 1, p) — (np) 2 

= np{(n — 1 )p + 1} — (np) 2 = npq. 

The usefulness of the notion of variance will appear only gradually, 
in particular, in connection with limit theorems (Chapter 10). Here 
we observe that the variance is a rough measure of spread. In fact, if 
Var(JT) = 2 (xj — v) 2 f(xf) is small, then each term in the sum is 
small. A value Xj for which | xj — ju | is large must therefore have 
a small probability f(xf). In other words, in case of small variance 
large deviations of X from the mean m are improbable. Conversely, a 
large variance indicates that not all possible values of X lie near the 
mean. 



9 . 5 ] 


COVARIANCE; VARIANCE OF A SUM 


179 


Some readers may be helped by the following interpretation in mechanics. 
Suppose that a unit mass is distributed on the z-axis so that the mass f(xj) is con¬ 
centrated at the point Xj. Then the mean n is the abscissa of the center of gravity , 
and the variance is the moment of inertia. Clearly different mass distributions may 
have the same center of gravity and the same moment of inertia, but it is well 
known that the most important mechanical properties can be described in terms 
of these two quantities. 

If X represents a measurable quantity like length or temperature, 
then its numerical values depend on the origin and the unit of measure¬ 
ment. A change of the latter means passing from AT to a new variable 
aX + 6, where a and b are constants. Clearly Var(Jf + b) = Var(JT), 
and hence 

(4.5) Var(a* + 6) = a 2 Var(*). 

The choice of the origin and unit of measurement is to a large degree 
arbitrary, and often it is most convenient to take the mean as origin 
and the standard deviation as unit. We have done so in Chapter 7, 
when we introduced the normalized number of successes S n * 
= ( S n — np)/{npq) A . In general, if X has mean y. and variance 
<7 2 (<t > 0), then X — ii has mean zero and variance a 2 , and hence the 
variable 

X — n 

(4.6) X * =-- 

<7 

has mean 0 and variance 1. It is called the normalized variable corre¬ 
sponding to X. In the physicist’s language, the passage from X to X * 
would be interpreted as the introduction of dimensionless quantities. 

5. Covariance; Variance of a Sum 

Let X and Y be two random variables on the same sample space. 
Then X + Y and XY are again random variables, and their distribu¬ 
tions can be obtained by a simple rearrangement of the joint distri¬ 
bution of X and Y. Our aim now is to calculate Var(JT + Y). For that 
purpose we introduce the notion of covariance, which will be analyzed 
in greater detail in section 8. If the joint distribution of X and Y is 
Vk)}y then the expectation of XY is given by 

(5.1) E(XY) = Hxfy k p(x j9 y k ), 

provided, of course, that the series converges absolutely. Now 
I x jVk | < (x 2 + yk 2 )/2 and therefore E(XY) certainly exists if E(X 2 ) 
and E{Y 2 ) exist. In this case there exist also the expectations 

(5.2) » x = E(X), p ff = E(Y), 



RANDOM VARIABLES 


180 


[9.5 


and the variables X — p x and Y — p y have zero means. For their 
product we have from the addition rule of section 2 

(5.3) E((X - Mx)(r - Mj,)) = E(XY) - n x E{Y) - p y E(X) + w y 

= E(XY) - p x p y . 


Definition. The covariance of X and Y is defined by 

(5.4) C0V(*, Y) = £((* - Mx)(y - M„)) = E(XY) - MxMy 

TAis definition is meaningful whenever X and Y have finite variances. 

We know from section 2 that for independent variables E(XY) 
= E(X)E(Y). Hence from (5.4) we have 

Theorem 1. If X and Y are independent , then Cov(X, Y) = 0. (The 
converse is not true; cf. section 8.) 

We now proceed to prove the fundamental 

Theorem 2. If X Xy • • •, X n are random variables with finite variances 
o x *, • • •, a n 2 , and S n = X x + • • • + X n , then 

(5.5) Var(S„) = jt a k 2 + 2 £ Cov(A* X k ) 

k=\ 

0 . 

pairs ( Xj, Xk) with j < k once 

and only once. 

In particular , if the Xj are mutually independent , then the addition rule 
(5.6) Var (S n ) = <n 2 + <r 2 2 + • • • + <r n 2 

holds. 

Proof. Put Me = .E(Xfc) and m„ = mi + • • • + A*n = E(S n ). Then 
- m n = 2(X k - (i k ) and 

(5.7) ( S n - m„) 2 = 2(A* - + 22(A) - nj)(X k - p k ). 

Taking expectations and applying the addition rule, we get (5.5). 
Equation (5.6) follows from the preceding theorem. 

Examples, (a) Binomial Distribution {b(k',n, p)\. In example 
(3.a), the variables X k are mutually independent. We have E(X k 2 ) 
= 0 - 2 q + l- 2 p — p, and E{X k ) = p. Hence <r k 2 = p — p 2 = pq, and 
from (5.6) we see that the variance of the binomial distribution is npq. 
The same result was derived by direct computation in example (4 .d). 



181 


9.5] COVARIANCE; VARIANCE OF A SUM 

(6) In example (3.c) the variables Xk are again independent and 
have the common distribution y v = q”~ 1 p (v = 1, 2, This is a 

geometric distribution. For its second moment we find 

(5.8) Z^y-V = qpXv(p - l)q v ~ 2 + pZy 1 

d 2 d 

= 9PT"2 + PT 

dq 2 dq 


2*7P _P_ 

■ - qf (1 ~ 9) 2 


2 q 1 

= H- 

P P 


Since the mean was found to be 1/p, we have oi? = —- H- - 

p p p 

q nq 

= — , and Var(S„) = —-. This is the variance of the number of trials 
P 2 P Z 

up to and including the nth success (or the variance of the Pascal distri¬ 
bution; cf. Chapter 11, section 3). Notfc that once more we have 
calculated a variance without knowing the distribution. 

(c) In the collector’s problem (3.d) the variables X k are still 
independent, but they no longer have a common distribution. We 
know that X k is the number of trials up to and including the first suc¬ 
cess in a sequence of Bernoulli trials with p = (N — k)/N. Hence 
from the preceding example E(X k ) = N/(N — k), and Var(Afc) 
= kN/(N ~ k) 2 . Thus 


Var(S n ) - N 


1 , 2 i 

(AT - l) 2 + (N - 2) 2 + ' 


(N - r + l) 2 !’ 


(i d ) Card Matching. A deck of n numbered cards is put into random 
order so that all n! arrangements have equal probabilities. The 
number of matches (cards in their natural place) is a random variable 
S n which assumes the values 0, 1, • • •, n. Its probability distribution 
was derived in Chapter 4. * From it the mean and variance could be 
obtained, but the following way is simpler and more instructive. 

Define a random variable X k which is either 1 or 0, according as 
card number k is or is not at the fcth place. Then S n = X x + • • • + X n . 
Now each card has probability 1/n to appear at the A;th place. Hence 
Pr{X k = 1} = 1/n and Pr\X k = 0} = (n — l)/n. Therefore E(X k ) 
= 1/n, and it follows that E(S n ) = 1: the average is one match per 
deck. To calculate the variance we first calculate the variance <r k 2 
of X k : 



(5.10) 


n 



182 


RANDOM VARIABLES 


[ 9.5 


Next we calculate E(XjX k ). The product XjXk is 0 or 1; the latter 
is true if both card number j and card number k are at their proper 
places, and the probability for that is l/n(n — 1). Hence 


(5.11) 


E(XjX k ) = 


n(n — 1 ) 9 


1 


Cov(Xj, Xk) = e __ , , 


1 


n(n — 1) n 2 n 2 (n — 1) 
1 

(n - 1) 


Thus finally 

n - 1 /n\ 1 

(5.12) Var(S„) = n — - -h 2 ( ) —-~ = L 

n 2 \2/ n 2 (n - 1) 


We see that for the number of matches both mean and variance are 
equal to one. This result may be applied to the problem of card guessing 
discussed in Chapter 4, section 4. There we considered three methods 
of guessing, one of which corresponds to card matching. The second 
can be described as a sequence of n Bernoulli trials with probability 
p — l/n, in which case the expected number of correct guesses is 
up = 1 and the variance npq = (n — 1 )/n. The expected numbers 
are the same in both cases, but the larger variance with the first method 
indicates greater chance fluctuations about the mean and thus promises 
a slightly more exciting game. (With more complicated decks of 
cards the difference between the two variances is somewhat larger but 
never really big.) With the last mode of guessing the subject keeps 
calling the same card; the number of correct guesses is necessarily 
one, and chance fluctuations are completely eliminated (variance 0). 
We see that the strategy of calling cannot influence the expected 
number of correct guesses, but has some influence on the magnitude 
of chance fluctuations. 

(e) Sampling without Replacement. Suppose that a population con¬ 
sists of h black and q green elements, and that a random sample of size 
r is taken (without possible repetitions). The number S k of black 
elements in the sample is a random variable with the hypergeometric 
distribution (Chapter 2, section 5) 


(5.13) 


g 

r — k 


) 


= 


c:o 


The mean and variance can be obtained by direct computation, but 
the following method is preferable. Define the random variable X k to 



9.61 


CHEBYSHEV’S INEQUALITY 


assume the values 1 or 0 according as the fcth element in the sample is 
or is not black (k < r). For reasons of symmetry the probability that 
Xk = 1 is 5/ (b + g), and hence 


(5.14) 


E(X k ) = 


b 

bT~g’ 


Var(AT*) = 


bg 

(b + g) 2 ' 


Next, iij^k, then XjX k = 1 if the jth and kth elements of the sample 
are black, and otherwise XjX k = 0. The probability of XjX k = 1 is 
b(b — 1 )/(6 + g)(b + g — 1), and therefore 


(5.15) 


E(X;X k ) = 


5(5 ~ 1 ) 

(5 + g){b + g - 1) ’ 


Cov(XjX k ) = 


-bg 


(5 + g) (b.+ g - l) 


Thus, 

(5.16) E(S r ) = 


rb 

b + g 


, Var(S r ) = 


rbg 


(b + g) 2 


1 - 


r — 1 
b + g - 1 


In sampling with replacement we would have the same mean, but the 
variance would be slightly larger, namely, rbg/(b + g) 2 [cf. example 

(a)]. 


6. Chebyshev’s Inequality 6 

It has been pointed out that a small variance indicates that large 
deviations from the mean are improbable. This statement is made 
more precise by Chebyshev’s inequality, which is an exceedingly 
useful and handy tool. 

Theorem . Let X be a random variable with mean n — E(X) and 
variance c 2 = Var(JT). Then for any t > 0 

(6.1) Pr{| *- M | ><} 

Proof. The variance is defined in (4.3) by a series with positive 
terms. Deleting all terms for which | xj — m | < t cannot increase the 
value of the series, and hence 

(6.2) cr 2 > S'fo - m) 2 /(*;) 


* P. L. Chebyshev (1821-1894). 



184 < RANDOM VARIABLES [9.7 

where the star indicates that the summation extends only over those j 
for which \ — n \ > t. It is then clear that 

(6.3) 2*(xj - n) 2 fixj) > t 2 2*f(xj) = t 2 Pr{\X - M | > <} 
which proves the theorem. 

Chebyshev’s inequality must be regarded as a theoretical tool rather 
than a practical method of estimation. Its importance is due to its 
universality, but no statement of great generality can be expected to 
yield sharp results in individual cases. 

Examples, (a) If X is the number scored in a throw of a true die, 
then [cf. example (4.6)], n = 7/2, a 2 = 35/12. The maximum devia¬ 
tion of X from ix is 2.5 « 3cr/2. The probability of greater deviations 
is zero, while Chebyshev’s inequality only asserts that this probability 
is smaller than 0.47. 

(6) For the geometric distribution q k = 2~ fc (k = 1, 2, ...) we have 
[cf. example (5.6)] n = 2, a 2 = 2. Here Pr{[X — 2 | > 2} = 2~~ 5 
+ 2~" 6 + 2~ 7 + ... = 2~ 4 while Chebyshev’s inequality gives an 
upper bound of 1/2. 

(c) For the binomial distribution {b(k; n, p)} we have [cf. example 
(5.a)] ju = np, a 2 = npq. For large n we know that 

(6.4) Pr{| S n — np | > x(npq) H ] « 1 — <f>(z) + 4>(— x). 

Chebyshev’s inequality states only that the left side is less than 1/x 2 ; 
this is obviously a much poorer estimate than (6.4). 

* 7. Kolmogorov’s Inequality 6 

As an example of more refined methods we prove: 

Let Xi, • • •, X n be mutually independent variables with expectations 
ixjc = E(Xjc) and variances a 2 . Put 

(7.1) S k = X 1 + •••+** 
and 

(7.2) m k = E(Sk) = Mi + • • * + wb, 
s k 2 - Var (Sjfc) = (Ti 2 + • • • + a* 2 . 

For every t > 0 the probability of the simultaneous realization of the n 
inequalities 

* Starred sections treat special topics and may be omitted at first reading. 

• Uber die Summen zuf&lliger Grossen, Mathematwche Annalen , vol. 99 (1928), 
pp. 309-319, and vol. 102 (1929), pp. 484-488. 



9.71 ' ■ KOLMOGOROV’S INEQUALITY 186 

(7.3) | S k ta k | < t8 nf k = 1, 2, • • •, n 
is at least 1 — t~~ 2 . 

For n = 1 this theorem reduces to Chebyshev’s inequality. For - 
n > 1 Chebyshev’s inequality gives the same bound for the probability 
of the single relation | S n — m n | < ts n , so that Kolmogorov's inequality 
is considerably stronger. 

Proof. We want to estimate the probability x that one of the 
inequalities (7.3) does not hold. The theorem asserts that x < t~ 2 . 
Define n random variables Y k as follows: Y v = 1 if 

(7.4) | S„ — m v | > ts n 
but 

(7.5) | S k — m k | < ts n for k = 1, 2, • • •, v — 1; 

Y v — 0 for all other sample points. In words, Y v equals 1 at those 
points in which the yth of the inequalities (7.3) is the first to be violated. 
Then at any particular sample point at most one among the Y k is 1, and 
the sum Fi + Y 2 + • • • + Y n can assume only the values 0 or 1; it is 1 
if and only if one of the inequalities (7.3) is violated, and therefore 

(7.6) x^P r {Y x + ••• + Y n = 1}. 

Since Y\ + • • • + Y n is 0 or 1, we have 2Y k < 1. Multiplying by 
(S n — m n ) 2 and taking expectations, we get 

(7.7) it-E(Y k (S n - m n ) 2 ) < s n 2 . 

k=l 

For an evaluation of the terms on the left we put 

n 

(7.8) U k = (S n - m n ) - (S k -m k )= E (*, “ M,)- 

v=k +1 

Then 

E(Y k (S n - m„) 2 ) 

(7.9) - E(Y k (S k - rrt k ) 2 ) + 2 E(Y k U k (S k - m k )) + E(Y k U k 2 ). 

However, U k depends only on X k+l , •••, X n , while Y k and S k 
depend only on Xy, • • •, X k . Hence U k is independent of Y k (S k — m k ) 
and therefore E(Y k XJ k (S k — m k )) = E(Y k (S k — m k ))E(U k ) — 0, since 
E(U k ) = 0. Thus from (7.9) 

(7.10) E(Y k (S n - m n ) 2 ) > E(Y k (S k - m k ) 2 ). 

But Yj. 7 * 0 only if | S k — m k | > ts n , so that Y k (S k — m k ) 2 > t 2 s 2 Y k . 



186 RANDOM VARIABLES [9.7 

Hence, combining (7.7) and (7.10), we get 

(7.11) s n 2 > t 2 s n 2 E(Y 1 + ... + F n ). 

Since Yi + • • • + Y n equals either 0 or 1, the expectation to the right 
equals the probability x defined in (7.6). Thus xt 2 < 1 as asserted. 

* 8. The Correlation Coefficient 

Let X and Y be any two random variables with means ji x and p y 
and positive variances a 2 and <r y 2 . We introduce the corresponding 
normalized variables X * and F* defined by (4.6). Their covariance is 
called the correlation coefficient of X, Y and is denoted by p(X, F). Thus, 
using (5.4), 

Cov(X,F) 

(8.1) p(X,F) = Cov(X*,F*) =- 

*X? Y 

Clearly this correlation coefficient is independent of the origins and 
units of measurements, that is, for any constants a u a 2 , &i, b 2 , with 
a x > 0, a 2 > 0, we have p(a x X + b\, a 2 Y + b 2 ) = p(X , F). 

The use of the correlation coefficient amounts to a fancy way of 
writing the covariance. 7 Unfortunately, the term correlation is sugges¬ 
tive of implications which are not inherent in it. We know from section 
5 that p(X, Y) = 0 whenever X and F are independent. It is important 
to realize that the converse is not true. In fact, the correlation coefficient 
p{X , Y) can vanish even if Y is a function of X. 

Examples, (a) Let X assume the values ±1, ±2 each with prob¬ 
ability 1/4. Let F = X 2 . The joint distribution is given by p(—1, 1) 
= p(l, 1) = p(2, 4) = p(—2, 4) == 1/4. For reasons of symmetry 
p(X 7 F) = 0 even though we have a direct functional dependence of 
FonX 

(b) Let U and V be independent variables with the same distribution, 
and let X = U + V, Y = U - V. Then E(XY) = E(U 2 ) - E(V 2 ) 
= 0 and E(Y) = 0. Hence Cov(X, F) = 0 and therefore also 
p(X, F) = 0. For example, X and F may be the sum and difference 
of points on two dice. Then X and F are either both odd or both even, 
and therefore dependent. 

It follows that the correlation coefficient is by no means a general 
measure of dependence between X and F. However, p(X, F) is con¬ 
nected with the linear dependence of X and F. 

* Starred sections treat special topics and may be omitted at first reading. 

7 The physicist would define the correlation coefficient as “dimensionless co- 
variance.” 



9.9] PROBLEMS FOR SOLUTION 187 

Theorem . We have always | p(X y Y) | < 1, and p(X, Y) =‘ ±1 only if 
Y = aX + b, where a and b are constants. 

Proof. Let X * and F* be the normalized variables. Then 
(8.2) Var(JT* ± Y*) = Var(AT*) ± 2 Cov(**, F*) + Var(F*) 

= 2(1 ± P (X, Y)). 

The left side cannot be negative; hence | p(X , Y) | < 1. For p(X, Y) 
= 1 it is necessary that Var(A'* — F*) = 0 which means that the 
variable X * — F* assumes only one value. In this case X* — F* 
= const., and hence F = aX + const, with a = (Ty/tr*- A similar 
argument applies to the case p(F, F) = — 1. 

9. Problems for Solution 

1. In 5 tosses of a coin let X, Y, Z be, respectively, the number of heads, the 
number of head runs, the length of the largest head run. Tabulate the 32 sample 
points together with the corresponding values of X , K, and Z. By simple counting 
derive the joint distributions of the pairs ( X } F), ( X , Z), (K, Z) and the distribu¬ 
tions of X + F and XY. Find the means, variances, covariances of the vari¬ 
ables. 

2. Birthdays. For a group of n people find the expected number of days of the 
year which are birthdays of exactly k people. (Assume 365 days and that all 
arrangements are equally probable.) 

3. Continuation. Find the expected number of multiple birthdays. How large 
should n be to make this expectation exceed 1? 

4. A man wants to open his door and has n keys. For reasons which can be 
only surmised he tries them independently and at random. Find the mean and 
variance of the number of trials (a) if unsuccessful keys are not eliminated from 
further selections; (6) if they are. (Assume that only one key fits the door.) 

5. Find the covariance of the number of ones and sixes in n throw's of a die. 

6. In example (3.e) find E(X 2 ) and hence an asymptotic expression for the 
variance as N —> » (with n fixed). 

7. Continuation. Find the joint distribution of the largest and smallest ob¬ 
servation. 

8. Continuation. Find the conditional probability that the first two observa¬ 
tions are j and k , given that X — r. 

9. In a sequence of Bernoulli trials let X be the length of the run (of either 
successes or failures) started by the first trial. Find the distribution of X , E(X), 
Var(AT). 

10. Continuation. Let F be the length of the second run. Find the distribution 
of F, 2£(F),.Var(F), and the joint distribution of X , F. 

11. Double hypergeometric. distribution. A population of n elements contains 
among others, n\ red and U 2 black elements (m + ni < n). A random sample of 
r elements is selected. Let X and F be the numbers of red and black elements in 
it. Find the joint distribution, means, variances, and covariance. (Specialize 
to spades and clubs in bridge.) 



1813 


RANDOM VARIABLES 


[9.9 


* 12. Let X be the number of runs of red things in a random arrangement of r\ 
red and black things. The probability distribution j tt* } of X is given in Chapter 3, 
problem 15. Find E{X) t Var(AT). 

13. In the Polya urn scheme of Chapter 5, example (2.c), let S n be the total 
number of black balls extracted in the first n drawings. Find E(S n ) and Var(S n ). 
(Use the results of Chapter 5, problems 16-17, and verify by means of the 
recursion formula, Chapter 5, problem 19.) 

14. In the collector’s problem (3 .d) let Y r be the number of drawings required to 
include r preassigned elements (instead of any r different elements as in the text). 
Find E(Y r ) and Var(F r ). 


‘5. nnd ® (j^) 


and Cov(JST, N ) in the example (l.e). 


[In industrial prac¬ 


tice the discovered defective item is replaced by an acceptable one so that K/{N + 1) 
is the fraction of defectives and measures the quality of the lot. Note that 


E (j^~) is not E(K)/E(N + 1).] 


16. Stratified sampling. A city has n blocks of which ny have Xj inhabitants 
(ni -f- 712 + •••■“ n). Let m = 'ZnjXj/n be the mean number of inhabitants per 
block and put a 2 = 2 njXj 2 /n — m 2 . In sampling without replacement r blocks are 
selected at random, and in each the inhabitants are counted. Let X\ } • • •, X r be 
the respective number of inhabitants. Show that 


E{X\ + • • ■ + X r ) = mr 

v«ur,+...+x,)-£$^2i 

n — 1 


(Note that with sampling with replacement the variance would be larger, namely, 
a 2 r.) 

17. Continuation* The number x of inhabitants of a city is estimated by the 
following double sampling procedure. The city is divided into n strata. The 
(known) number of blocks in stratum j is ny, so that n = 2ny is the total number of 
blocks in the city. The (unknown) number of inhabitants of the kth block in the 
jth stratum is Xjk (so that xj = xjk is the number of inhabitants of the stratum j, 

k 

and x «= ^ Xj is the number of inhabitants of the city). Out of the stratum j 
i 

a random sample of ry blocks is drawn, and the number of people in each is counted. 
Let Xjk be the number of people in the kth block of the sample taken from the 
jth stratum. Then Xj » 2l Xjk is the total number of inhabitants in the blocks 

k 

_ fij 

sampled in stratum number j. Put X = 2-/ — Xj. Show (using the preceding 

2 Tj 

result) that E(X) = * and Var(J) - So-/ ~ ^ , where 

Tj(nj - 1) 



8 Stratified sampling is usual in many applications, since greater accuracy can 
be achieved at smaller costs. In population sampling the strata are often further 
subdivided. There exists an elaborate theory of sampling under various condi¬ 
tions. Cf. A Chapter in Population Sampling , by the Sampling Staff, Bureau of 
the Census, U. S. Government, Washington, 1947. 



9.9] 


PROBLEMS FOR SOLUTION 


189 


18. ® A large number, N , of people are subject to a blood test. This can be 
administered in two ways, (i) Each person can be tested separately. In this case 
N tests are required, (ii) The blood samples of k people can be pooled and analyzed 
together. If the test is negative , this one test suffices for the k people. If the test 
is positive .;, each of the k persons must be tested separately, and in all k + 1 tests 
are required for the k people. 

Assume the probability p that the test be positive is the same for all people and 
that people are statistically independent. 

(a) What is the probability that the test for a pooled sample of k people will be 
positive? 

(b) What is the expected value of the number, X, of tests necessary under plan 
(ii)? 

(c) What should k be to minimize the expected number of tests under plan (w)? 
Do not try numerical evaluations, since the problem leads to a rather cumbersome 
equation for k. 

19. Let S n be the number of successes in n Bernoulli trials. Prove 

E{\ S n - np |) = 2w/60; n, p) 


where v is the integer such that np < v < np + 1. 

20. Let S n be the number of successes in n independent trials with probabilities 

Ph P 2 , • • *, Vn of success. Let a n = (pi + P2 H-h Pr)/n be the average proba¬ 

bility of success. Show that for given a n the maximum of Var(S rt ) is attained when 
all pk are equal. 10 

21. If two random variables X and Y assume only two values each, and if 
Cov(X, Y) — 0, then X and Y are independent. 

22 . Generalized Chebyshev inequality. Let <£(:r) bo monotonically increasing, 
positive, and even, and suppose that E(tf>(X)) = M exists. Prove 


Pr || X \ > t\ < 


M 

W) 


23. Let (A*) be a sequence of mutually independent random variables with a 
common distribution. We assume that the Xk assume only positive values and 

that E(Xk) = a and E(Xk~ l ) = 6 exist. Let S n *= X\ H- \- X n . Prove that 

E^SnT 1 ) is finite and that E(Xi c /S n ) = 1/n for k — 1, 2, • • •, n. 

24. Continuation . n Prove that 


E 



E 



— , if m < n 
n 

1 H- (m — if tn > n. 


9 This problem is based on a new 1 technique developed during World War II. 
See R. Dorfman, The Detection of Defective Members of Large Populations, Annals 
of Mathematical Statistics , vol. 14 (1943), pp. 43G-440. In Army practice plan (ii) 
introduced up to 80 per cent savings. 

10 The variability of p n may be interpreted as disorder and in this sense disorder 
decreases cHance fluctuations. (For example, the number of annual fires in a com¬ 
munity may be treated as a random variable; for a given average number a n the 
variability is maximal if all households have the same probability of fire.) Trials 
of the type described in problem 20 are called Poisson trials [cf. Chapter 11, example 
( 8 . 6 )]. 

11 The observation that 24 can be proved by introducing 23 is due to K. L. Chung. 



190 


RANDOM VARIABLES 


[9.9 


25. Let Xi, • ••, X n be mutually independent random variables. Let V be a 
function of X\, • • •, Xk and V a function of Xk+i, • • *, X n (k < n). Prove that 
U and V are mutually independent random variables. 

Hint: Consider the two sample spaces defined, respectively, by the sets 
(Xi f • • •, Xk) and (ATk.fi, * * ** X n ). 

26. A sequence of Bernoulli trials is continued as long as necessary to obtain r 
successes, where r is a fixed integer. Let AT be the number of trials required. Find 12 
E{r/X). (The definition leads to infinite series for which a finite expression can be 
obtained. The distribution of AT is called after Pascal; cf. Chapter 11, section 3.) 

27. Length of random chains. 13 A chain in the x, y-plane consists of n links, 
each of unit length. The angle between two consecutive links is dba where a 
is a positive constant;' each possibility has probability 1/2, and the successive 
angles are mutually independent. The length L n of the chain is a random variable, 
and we wish to prove that 


(9.1) 


E(L n 2 ) = n 


1 -f cos a 
1 — cos a 


2 cos a 


1 — COS w a 
(1 — COS a) 2 


Without loss of generality the first link may be assumed to lie in the direction of 
the positive z-axis. The angle between the kth link and the positive z-axis is a 
random variable Sk-i where S 0 — 0, Sk — Sk-i + Xka and the Xk are mutually 
independent variables, assuming the values ±1 with probability 1/2. The projec¬ 
tions on the two axes of the &th link are cos Sk -1 and sin Sk- 1 . Hence for n > 1 

( n —1 v 2 /«-1 x 2 

£ cos S/c i + ( £ sin Sic) . 

Prove by induction successively for m < n 

(9.3) 2?(cos S n ) = cos n a, E{ sin S n ) = 0; 

(9.4) E(( cos iS m )-(cos S n )) = cos n ~ m a>E(cos 2 Sm) 

(9.5) ((sin S m ) • (sin S n )) = cos n ~ m a • tf(sin 2 S m ) 

„ „ 1 — cos n—1 a 

(9.6) E(Ln 2 ) - E(Ln~- I 2 ) = 1 + 2 cos a • -- 

1 — cos a 

(with I/o «■ 0) and hence finally (9.1). 

12 This example illustrates the effect of optional stopping. If the number n of 
trials is fixed, the ratio of the number N of successes to the number n of trials is 
a random variable whose expectation is p. It is often erroneously assumed that 
the same is true in our example where the number r of successes is fixed and the 
number of trials depends on chance. If p — 1/2 and r — 2, then E( 2/AT) = 0.614 
instead of 0.5; for r * 3 we find E( 3/AT) = 0.579. 

18 This is the two-dimensional analogue to the problem of length of long polymer 
molecules in chemistry. The problem is to illustrate applications to random vari¬ 
ables which are not expressible as sums of simple variables. 



CHAPTER 10 


LAWS OF LARGE NUMBERS 

1. Identically Distributed Variables 

The limit theorems for Bernoulli trials derived in Chapters 7 and 8 
are special cases of general limit theorems which cannot be treated in 
this volume. However, we shall here discuss at least some cases of 
the law of large numbers in order to reveal a new aspect of the notion 
of the expectation of a random variable. 

The connection between Bernoulli trials and the theory of random 
variables becomes clearer when we consider the dependence of the 
number S n of successes on the number n of trials. With each trial S n 
increases by 1 or 0. To translate this statement into a formula, we 
write 

( 1 . 1 ) Sn = X x + • •. + X ni 

where the random variable X k equals 1 or 0 according to whether the 
ifcth trial results in success or failure. Thus S n is a sum of n mutually 
independent random variables, each of which assumes the values 1 and 0 
with probabilities p and q. From this it is only one step to consider 
sums of the form (1.1) where the X k are mutually independent variables 
with an arbitrary distribution. The (weak) law of large numbers of 
Chapter 7, section 3, states that for large n the average proportion of 
successes S n /n is likely to lie near p. This is a special case of the follow¬ 
ing 

Law of Large Numbers. Let {AT*} be a sequence of mutually independent 
random variables with a common distribution. If the expectation 
n = E{X k ) exists , then for every e > 0 as n —> <» 

(1.2) Pr ( - + " n + - X ~ — /* > «} -» 0; 

in words, the probability that the average S n /n will differ from the 
expectation by less than an arbitrarily prescribed e tends to one. 

In this generality the theorem was first proved by Khintchine. 1 

1 A. Khintchine, Sur la loi des grands nombres, Comptes rendus de VAcadbnie des 
Sciences , vol. 189 (1929), pp. 477-479. 


191 



102 LAWS OF LARGE NUMBERS [10.1 

Older proofs had to introduce the unnecessary restriction that also the 
variance Var(Jffc) be finite. 2 For this case, however, there exists a 
much more precise result which generalizes the DeMoivre-Laplace 
limit theorem for Bernoulli trials. 

Central Limit Theorem. Let {Xk} be a sequence of mutually inde¬ 
pendent random variables with a comm,on distribution . Suppose that 
p = E(Xk) and <t 2 = Var^) exist and let S n = X !+•••+ X n . 
Then for every fixed a, P (a < P) 

f S n — nu 1 

(1.3) Pr |a < < jSj - HP) - *(«); 

here <l»(rr) is the normal distribution introduced in Chapter 7, section 1. 
This theorem is due to Lindeberg; 3 Ljapunov and other authors had 
previously proved it under more restrictive conditions. It must be 
understood that the above theorem is only a very special case of a 
much more general theorem which is closely connected with many 
other limit theorems. We shall defer its general formulation and proof 
to the second volume. Here we note that (1.3) is much stronger than 
(1.2), since it gives an estimate for the probability that the discrepancy 

- S n — jk I be larger than a/n A . On the other hand, the law of large 

n I 

numbers (1.2) holds even when the random variables X k have no finite 
variance so that it is more general than the central limit theorem. For 
this reason we shall give an independent proof of the law of large 
numbers, but first we illustrate the two limit theorems. 

Examples, (a) Consider a sequence of independent throws of a 
symmetric die and let Xk be the number scored at the kth throw. Then 
E{X k ) -(1 + 2 + 3 + 4 + 5 + 6)/6 - 3.5, and Var(X fc ) = (l 2 + 2 2 
+ 3 2 + 4 2 + 5 2 + 6 2 )/6 — (3.5) 2 = 35/12, and S n /n is the average 
score in n throws. The law of large numbers states that for large n this 
average is likely to be near 3.5. The central limit theorem states that 
the probability of | S n — 3.5n | < a-(35n/12)^ is about $(a) 

— $(—a). For n = 1000 and a = 1 we find that there is roughly 
probability 0,68 that 3450 < S n < 3550. Choosing for a the median 
value a = 0.6744, we find that there are roughly equal chances that 
S n lies within or without the interval 3500 dt 36. 

(b) Sampling. Suppose that in a population of N families there are 
Nk families with exactly k children (& = 0, 1, ...; EiV*. = N). If a 

1 A. Markov showed that the existence of E(\ Xk |* +a ) for some a > 0 suffices. 
* J. W. Lindeberg, Eine neue Herleitung des Exponentialgesetzes in der Wahr- 
scheinlichkeitsrechnung, Mathematische Zeitschrift , vol. 15 (1922), pp. 211-225. 



10 . 1 ] 


IDENTICALLY DISTRIBUTED VARIABLES 


family is chosen at random, the number of children in it is a random 
variable which assumes the value v with probability p v = N v /N. In 
sampling with replacement a sample of size n contains n independent 
random variables or “observations” • • •, X n , each with the same 
distribution; S n /n is the sample average . The law of large numbers 
tells us that for sufficiently large random samples the sample average 
is likely to be near p = Xivp v = Si 'NJN, which is the population 
average. The central limit theorem permits us to estimate the probable 
magnitude of the discrepancy and to determine the sample size neces¬ 
sary for reliable estimates. In practice both p and a 2 are unknown. 
However, it is usually easy to obtain a preliminary estimate of a 2 , and 
it is always possible to keep to the safe side. If it is desired that 
there be probability 0.99 or better that the sample average S n /n 
differs from the unknown population mean p by less than 1 / 10 , then 
the sample size should be such that 


(1.4) 


Pr 



< —1 > 0.99. 
10J 


Now the root of — 4>(— x) = 0.99 is x = 2.57..., and hence n 
should be such that n^/lOa > 2.57 or n > 660o- 2 . A cautious pre¬ 
liminary estimate of a 2 gives us an idea of the required sample size. 
Similar situations occur frequently. Thus when the experimenter 
takes the mean of n measurements he, too, relies on the law of 
large numbers and uses a sample mean as an estimate for an unknown 
theoretical expectation. The reliability of this estimate can be judged 
only in terms of <r 2 , and usually we are compelled to use rather crude 
estimates for a 2 . 

(c) The Poisson Distribution. In Chapter 7, section 4, we found 
that for large X the Poisson distribution \p(k;\)\ can be approximated 
by the normal distribution. This is really a direct consequence of the 
central limit theorem. Suppose that the variables Xk have a Poisson 
distribution { p(k ; 7 )}. Then S n has a Poisson distribution {p(k; ny)} 
with mean and variance equal to ny. Writing X for ny , we conclude 
that for every a < & as X —> 00 


(1.5) 




e~ x X* 

*! 


<D(/3) - *(«); 


the summation extends over all k in the interval (X + aX^, X + jSX^). 
This theorem is used in the theory of summability of divergent series 
and is of general interest; estimates of the difference of the two sides 
in (1.5) are available from the general theory. 



- 194 LAWS OF LARGE NUMBERS [10.1 

As an application of (1.5) suppose that X is an integer and 
let us compute 


( 1 . 6 ) 


U r = 


E — 

-r^As-X^r fc! 


To make the limits of summation in (1.5) and (1.6) coincide we may 
take for —a and 0 any number between rX~^ and (r + 1)X~^. As 
X increases, this interval of uncertainty decreases, and in the limit the 
freedom of choice of a and j8 disappears. However, for moderate 
values of X it is most natural to choose for —a and 0 the midpoint 
(r + We, therefore, compare u r with the normal approxima¬ 

tion 

(1.7) «/ = *((r + i)X~ H ) - <K- (r + 

Table 1 shows that the degree of approximation is highly satisfactory 
(further numerical comparisons are found in Chapter 7, section 4). 


TABLE 1 

Showing the Normal Approximation (1.7) to the Poisson Expression (1.6) 


X r 

9 3 

16 4 

25 5 

49 7 

100 10 


U T 

0.760 08 
.741 17 
.729 73 
.716 53 
.706 52 


Ur ' 

0.756 66 
.739 41 
.728 67 
.716 01 
.706 28 


Difference 

0.0034 

.0018 

.0011 

.0005 

.0002 


Both the law of large numbers and the central limit theorem become 
meaningless if the expectation n does not exist, but it is possible to 
replace them by more general limit theorems. It will be shown later 
on that most recurrence times connected with physical processes are 
random variables without finite expectation. Even in the simple coin¬ 
tossing game the number of tosses up to the first equalization of the 
accumulated numbers of heads and tails is a random variable to which 
the law of large numbers does not apply. The corresponding limit 
theorems (in particular the arc sine law) will be discussed in detail in 
Chapter 12, section 5, where it will be found that the fluctuations of 
such variables have many surprising features and that they are entirely 
different from the fluctuations described in this chapter. 



10.2] PROOF OF THE LAW OF LARGE NUMBERS 195, 

* 2. Proof of the Law of Large Numbers 

We proceed in two steps. First we assume that <r 2 — Var (X k ) 
exists, and note that in this case Var(S n ) = n<r 2 , by the addition rule 
[formula (5.6) of Chapter 9], According to the Chebyshev inequality 
(Chapter 9, section 6), we have for every t > 0 

i i no " 2 

(2.1) Pr{\ S n - nn > t < —• 

r 

For t = cn we find that the left side is less than a 2 /e 2 n, which quantity 
tends to zero. This accomplishes the proof. 

We now drop the restrictive condition that Var(Afc) exists. This 
case is reduced to the preceding one by the rnethpd of truncMion which 
is an important standard tool used (with various refinements) in many 
similar cases. We define two new collections of random variables 
depending on the X k . Let, for k ■-= 1, 2, • • •, n, and fixed e > 0, 

14 = x k , V k = 0 if \x k \< m; 

(2.2) 

£4 = 0, V k = X k if I X k I > m. 

Then 

(2.3) X k = U k + V k 
for all k. 

Let } be the common probability distribution of the variables 
X k . We have assumed that n = E(X k ) exists, which means that 

(2.4) S| Xj |/(xy) = A 
is finite. Then 

(2.5) n' = E(U k ) = E 

| Xj | <, tn 

the summation extending over those j for which | Xj | < en. Note 
that / depends on n but is common to U\, U 2 , • • •, U n • Moreover, 
m' —> g as n —» oo, and hence for all n sufficiently large and arbitrary 
5 > 0 

(2.6) | m' — M | <5. 

Furthermore, from (2.5) and (2.4), 

(2.7) Var(C4) < E(U k 2 ) < en E I x i IM) < ^ n - 

| ary | ^ tn 


* Starred sections treat special topics and may be omitted at first reading. 



196 


LAWS OF LARGE NUMBERS 


[10.3 


The Uk are mutually independent, and their sum Ui + U 2 + • • • + U n 
can be treated exactly as the AT* in the case of finite variances; applying 
the Chebyshev inequality, we get the following analogue to (2.1) 


( 2 . 8 ) 


Pr 


Ul+"'+Un 

n 



Var(£4) cA 
< n5 2 ~ < ¥' 


In view of (2.6) this implies 


(2.9) 



l/l H-h Un 

n 




> 28 



Next we note that there is a large probability that Vk = 0. In fact 

(2.10) Pr{V h * 0} = E /(*,)<- Z \*}\ faj). 

|a:,i.> en eTL \xj\ > cn 

Since the series (2.4) converges, the last sum tends to 0 with increasing 
». Therefore for n sufficiently large 


(2.11) Pr{V k * 0}<- 

n 

and hence [cf. Chapter 1, formula (6.6)], 

(2.12) Pr{Vt + ••• + V n ^0} < «. 

Now S n = (t/i + ••• + £/„) + (Fi + • • • + V n ), and therefore from 
(2.9) and (2.12) 


f 

S n 

> 25 < Pr 

V\ + • • * + U n 

> 25 

Pr 

-M 

M 

{ 

n 

) l 

n 



+ Pr{Vi + • • • + V n * 0} < —+ 6. 


Since € and 8 are arbitrary, the right side can be made arbitrarily small, 
and this proves the assertion. 


3. The Theory of “Fair” Games 

For a further analysis of the implications of the law of large numbers 
we shall use the time-honored terminology of gamblers, but our dis¬ 
cussion bears equally on less frivolous applications, and our two basic 
assumptions are more realistic in statistics and physics than in gambling 
halls. First, we shall assume that our gambler possesses an unlimited 
capital so that no loss can force a termination of the game. (Dropping 
this assumption leads to the problem of the gambler^ ruin, which 



10.3] 


THE THEORY OF “FAIR” GAMES 


197 


from the very beginning has intrigued students of probability. It is 
of importance in Wald’s sequential analysis and in the theory of 
stochastic processes, and will be taken up in Chapter 14.) Second, we 
shall assume that the gambler does not have the privilege of optimal 
stopping; the number n of trials must be fixed in advance independently 
of the development of the game. In practice a player blessed with an 
unlimited capital would wait for a run of good luck and quit at an 
opportune moment. Such a player is not interested in the probable 
fluctuation at a prescribed moment, but only in the maximal fluctua¬ 
tions in the long run. Light is shed on this problem by the law of the 
iterated logarithm rather than by the law of large numbers (cf. Chapter 
8 , section 5). 

The random variable X k will now be interpreted as the (positive 
or negative) gain at the Zcth trial of a player who keeps playing the same 
type of game of chance. The sum S n = X x + • • • + X n is the accu¬ 
mulated gain in n independent trials. If the player pays for each trial 
an entrance fee p (not necessarily positive), then np represents the 
accumulated entrance fees, and S n — n\j! the accumulated net gain. 
The law of large numbers applies if p = E{X k ) exists. It says roughly 
that for sufficiently large n the difference S n — nu is likely to be small 
in comparison to n. Therefore, if the entrance fee p' is smaller than p, 
then, for large n, the player is likely to have a positive gain of the 
order of magnitude n( t u — p). For the same reason an entrance fee 
p > p is practically sure to lead to a loss. In short, the case p < p 
is favorable to the player, while p > p is unfavorable. 

Note that nothing is said about the case p = p. The only possible 
conclusion in this case is that, for n sufficiently large, the accumulated 
gain or loss S n — np will with overwhelming probability be small in 
comparison with n. It is not stated whether S n — np is likely to be 
positive or negative, that is, whether the game is favorable or un¬ 
favorable. This was overlooked in the classical theory which called 
p' = p a “fair” price and a game with p! = p “fair.” Much harm was 
done by the misleading suggestive power of this name. It must be 
understood that a “fair” game may be distinctly favorable or unfavor^ 
able to the player. 

It is clear that “normally” not only E(X k ) but also Yar(X k ) exists. 
In this case the law of large numbers is supplemented by the central 
limit theorem, and the latter tells us that, with a “fair” game, the 
long-run net gain S n — np is likely to be of the order of magnitude n* 
and that for large n there are about equal odds for this net gain to be 
positive or negative. Thus, when the central limit theorem applies, 
the term “fair” appears justified, but even in this case we deal with a 



108 


LAWS OF LARGE NUMBERS 


[10.3 


limit theorem with emphasis on the words “long run.” A closer 
analysis shows that the convergence in (1.3) deteriorates with increasing 
variance. If a 2 = Var (Xk) is large, then the normal approximation 
may become effective only for exceedingly large n. 

To fix ideas, consider a slot machine where the player has a prob¬ 
ability of 10~ 6 to win 10 6 — 1 dollars, and the alternative of losing the 
entrance fee n' = 1. Here we have Bernoulli trials, and the game is 
“fair.” In a million trials the player pays as many dollars in entrance 
fees. He may hit the jackpot 0 , 1 , 2, ... times. We know from the 
Poisson approximation to the binomial distribution that, with an 
accuracy to several decimal places, the probability of hitting the jack¬ 
pot exactly k times is e 1 /jfc!. Thus the player has probability 0.368... 
to lose a million, and the same probability of barely recovering his 
expenses; he has probability 0.184... to gain exactly one million, etc. 
Here 10 6 trials are equivalent to one single trial in a game with the 
gain distributed according to a Poisson distribution (which could be 
realized by matching two large decks of cards; cf. Chapter 4, section 4). 
Now all fire, automobile, and similar insurance is of the described type; 
the risk involves a huge sum, but the corresponding probability is 
very small. Moreover, one plays ordinarily only one trial per year, 
so that the number n of trials never grows large. For the insured the 
game is necessarily “unfair,” but it may well be economically advan¬ 
tageous: the law of large numbers is of no relevance to him. As for 
the company, it plays a large number of games, but because of the 
large variance the chance fluctuations are pronounced. The premiums 
must be fixed so as to preclude a huge loss in any specific year, and 
hence the company is concerned with the ruin problem rather than 
the law of large numbers. 

When the variance is infinite, the term “fair games” becomes an 
absolute misnomer; there is no reason to believe that the accumulated 
net gain S n — n\t! fluctuates around zero. In fact, there exist examples 
of “fair” games 4 where the probability tends to one that the player 
will have sustained a net loss. The law of large numbers asserts that 
this net loss is likely to be of smaller order of magnitude than n. How¬ 
ever, nothing more can be asserted. If a n is an arbitrary sequence such 
that a n /n— >0, it is possible to construct a “fair” game where the 
probability tends to one that at the nth trial the accumulated net loss 
exceeds a n . Problem 12 contains an example where the player has a 
practical assurance that his loss will exceed n/logn. This game is 
“fair,” and the entrance fee is unity. It is difficult to imagine that a 

4 W. Feller, Note on the Law of Large Numbers and “Fair” Games, Annals of 
Mathematical Statistics , vol. 16 (1945), pp. 301-304. 



10.4] 


THE PETERSBURG GAME 


199 


player will find it “fair” if after 1,000,000 games he is practically 
certain to have lost more than 150,000 units, and the loss is likely to 
keep increasing. 

* 4. The Petersburg Game 

If the variables Xk have no finite expectation, the law of large 
numbers becomes inapplicable and must be replaced by another limit 
theorem describing the asymptotic behavior of the sum S n . In the 
second volume we shall generalize both the law of large numbers and 
the central limit theorem to random variables without expectations; 
this condition is typical for recurrence times in many physical processes, 
and the generalized limit theorems are therefore of more than theo¬ 
retical interest. 6 The classical theory had no rigorous mathematical 
formulations at its disposal and had therefore conceptual difficulties 
with random variables without expectation. The law of large numbers 
was intimately connected with the concept of probability and was not 
susceptible of a mathematical analysis. This led even quite recent 
writers to conclusions which are difficult to understand from the point 
of view of a formalized theory. The general theory of limit theorems 
is deferred to the second volume, but it seems appropriate to describe 
the modern approach, using the time-honored example of the so-called 
Petersburg paradox. 6 

A single trial in the Petersburg game consists in tossing a true 
coin until it falls heads; if this occurs at the rth throw the player 
receives 2 r dollars. In other words, the gain at each trial is a random 
variable assuming the values 2 1 , 2 2 , 2 3 , ... with corresponding prob¬ 
abilities 2~ 1 , 2“ 2 , 2 -3 , _ The expectation is formally defined by 

2 x r f{x r ) with x r = 2 r and f(x r ) = 2 ~ r , so that each term of the series 
equals 1. Thus the gain has no finite expectation, and the law of large 
numbers is inapplicable. Now the game becomes less favorable to the 
player when amended by the rule that he receives nothing if a trial 
takes more than N tosses (that is, if the coin falls tails N times in 
succession). In this amended game the gain has the finite expectation 
A, and the law of large numbers applies. Therefore, if the player 
pays a constant entrance fee y! > 0 for each trial and plays n games, 
then for n sufficiently large he is almost sure to have a net profit. This 

* Starred sections treat special topics and may be omitted at first reading. 

6 An interesting special case, connected with the coin-tossing game, is discussed 
in Chapter 12, section 5. It will give a fair idea about the nature of fluctuations 
of sums of random variables without finite expectation. 

6 This paradox was discussed by Daniel Bernoulli (1700—1782), who tried in vain 
to solve it by the concept of moral expectation. Note that Bernoulli trials are 
named after James Bernoulli. 



200 


LAWS OF LARGE NUMBERS 


[10.4 


is true for every p\ but the larger p r , the larger must n be in order that a 
positive gain be probable. The classical theory concluded that p f = oo 
is a “fair” entrance fee, but the modern student will hardly understand 
the mysterious discussions of this “paradox.” 

It is perfectly possible to determine entrance fees with which the 
Petersburg game will have all properties of a “fair” game in the classical 
sense except that these entrance fees will depend on the number of 
trials instead of remaining constant. Variable entrance fees are un¬ 
desirable in gambling halls, but there the Petersburg game would be 
impossible anyway because of limited resources. In the case of a finite 
expectation p = E(Xk) > 0, a game is called “fair” if for large n the 
ratio of the accumulated gain S n to the accumulated entrance fees 
e n = up! is likely to be near 1 (that is, if the difference S n — e n is 
likely to be of smaller order of magnitude than e n = np'). If E(Xk) 
does not exist, we cannot put e n = np but must determine e n in another 
way. We shall say that “a game is fair ” in the classical sense if it is 
possible to determine accumulated entrance fees e n so that for every e > 0 


(4.1) 


Pr 


e n 


> e\ —> 0. 


pus is the complete analogue of the law of large numbers where 
e n = np\ The latter is interpreted by the physicist to the effect that 
the average of n independent measurements is bound to be near p. 
In the present instance the average of n measurements is bound to be 
near e n /n . Our limit theorem (4.1), when it applies, has a mathematical 
and operational meaning which is not different from the law of large 
numbers. 

We shall now show 7 that the Petersburg game becomes “fair” in the 
classical sense if we put e n = n Log n, where Log n is the logarithm to 
the base 2, that is, 2 Log n = n. 

Proof . We use again the method of truncation of section 2. Instead 
of (2.2) we now define the variables Uk and Vk (k = 1 , 2 , • • •, n) by 


(4.2) 


Uk = Xky Vk = 0 if Xk <n Log n; 
Uk = 0, V k = X k if X k > n Log n. 


Then again Xk == Uk + Vk, and the Uk are mutually independent. For 

7 This is a special case of a generalized law of large numbers from which necessary 
anfi sufficient conditions for (4.1) can easily be derived; cf. W. Feller, Acta 
Scientiarum Litterarum Univ. Szeged , vol. 8 (1937), pp. 191-201. 



1 


10.5] VARIABLE DISTRIBUTIONS 201 

every t we have Pr[Xk > l\ < 2 /J and hence Pr { ^ 0} < 2 /nLogn, 
or 

(4.3) Pr{V 1 + V 2 +---+V n > 0} <-^- > 0. 

Log n 

To verify (4.1) it suffices therefore to prove that 

(4.4) Pr{| Ui + • • • + U n — n Log n | > en Log n) —> 0. 

Now put n = E(U k ) and a 2 = Var(t4). These quantities depend on n , 
but are common to U ly U 2l • • •, U n . Tf r is the largest integer such that 
2 r < n Log n, then ju = r and hence for sufficiently large n 

(4.5) Log n < n < Log n + Log Log n. 

Similarly 

(4.6) a 2 < E(U k 2 ) = 2 + 2 2 + • • • + 2 r < 2 r+1 < 2 n Log n. 

Since the sum U\ + • • • + U n has mean nn and variance no- 2 , we 
have by Chebyshev’s inequality 

(4.7) Pr{\ U x -\ - U n - nn\ > en„] < < - -. 0 . 

e 2 n 2 n 2 e 2 Log n 

Now by (4.5) n ~ Log n, and hence (4.7) is equivalent to (4.4) and 
therefore to (4.1). 

5. Variable Distributions 

Up to now we have considered only the case where the variables X k 
have the same distribution. In gambling this case arises when the 
player keeps playing the same game of chance; however, it is even 
more interesting to see what happens if the type of game changes at 
each step. It is not necessary to think of gambling places; the statis¬ 
tician who applies statistical tests is engaged in a dignified sort of 
gambling, and in his case the distribution of the random variables 
changes from occasion to occasion. 

To fix ideas we shall imagine that an infinite sequence of probability 
distributions is given. For every n we may then speak of mutually 
independent variables X x , •••, X n with the prescribed distributions. 
We shall assume that the means and variances exist and put 

(5.1) = E(X k ), <r k 2 = Var(JT*). 

The sum S n = X\ + • • • + X n has also finite mean and variance 

(5.2) m n = E(S n ), s n 2 — Var(S„) 



202 


t LAWS OP LARGE NUMBERS 


[10.5 


which are given by 

(5.3) m n = Ml + • • • + Mn, Sn 2 = (*\ + * ■ * + <*n 

[cf. Chapter 9, formulas (2.4) and (5.6)]. In the special case of identical 
distribution we had m n = ny, s n 2 = na 2 . 

The (weak) law of large numbers is said to hold for the sequence {Xk} 
if for every € > 0 

(5.4) Pr 

l n 



The sequence \Xk) is said to obey the central limit theorem if for every 
fixed a < & 


(5.5) 


Pr 


S n - m n 

a < - 

Sn 



HP) - Ha). 


It is one of the salient features of probability theory that both the 
law of large numbers and the central limit theorem hold for a sur¬ 
prisingly large class of sequences {Xk}. In particular, the law of large 
numbers holds whenever the Xk are uniformly bounded , that is, whenever 
there exists a constant A such that 


(5.6) 


1*1 < A 


for all k. More generally, a sufficient condition for the law of large 
numbers to hold is that 


(5.7) 



0 . 


This is a direct consequence of the Chebyshev inequality, and the 
proof given in the opening passage of section 2 applies. Note, however, 
that the condition (5.7) is not necessary (cf. problem 14). 

Various sufficient conditions for the central limit theorem have been 
discovered, but all were superseded by the Lindeberg 8 theorem according 
to which the central limit theorem holds whenever for every e > 0 the 
truncated variables 14 defined by 

u k = X k if \X k \< €S n , 

( } U k — 0 if | * | > ts n , 

(k = 1 , 2, • • •, n) satisfy the condition 

(5.9) Var(f7j +••• + £/„)- s„ 2 . 

• J. W. Lindeberg, loc. cit. (footnote 3). 



10.5] VARIABLE DISTRIBUTIONS 203 

Here the sign ~ indicates that s n —► °o and that the ratio of the 
two sides tends to unity. 

If the X k are uniformly bounded [that is, if (5.6) holds] then 
Uk = X k for all n which are so large that s n > Ae~ l . Therefore the 
Lindeberg theorem implies that every uniformly bounded sequence [X k \ 
of mutually independent random variables obeys the central limit theorem , 
provided, of course, that s n —» oo. (The last condition is violated only 
for degenerate sequences.) It was found that this condition is also 
necessary for (5.5) to hold. 9 The proof is deferred to the second 
volume, where we shall also give estimates for the difference between 
the two sides in (5.5). 

In the case where the variables X k have a common distribution we 
found the central limit theorem to be stronger than the law of large 
numbers. This is not so in general, and we shall see that the central 
limit theorem may apply to sequences which do not obey the law of 
large numbers. 

Examples, (a) Let X > 0 be fixed, and let X k = ±fc x , each with 
probability 1/2 (e.g., a coin is tossed, and at the kth throw the stakes 
are ±k x ). Here m/c = 0, <t 2 = k 2X , and 

n 2X+1 

(5.10) s n 2 = 1 2X + 2 2X + 3 2X + • • • + n 2X - - 

2X + 1 

The condition (5.7) is satisfied if X < 1/2. Therefore the law of large 
numbers holds if X < 1/2; we shall presently see that it does not hold 
if X > 1/2. 

For k = 1 , 2 , • • n we have | X k | = k x < n x , so that for 
n > (2X + l)e —2 the truncated variables U k are identical with the X k . 
Hence the Lindeberg condition applies for X > 0, and 

{ /2X + 1\^ 1 

« < ^ - m - 

It follows that S n is likely to be of the order of magnitude n x+ so 
that the law of large numbers cannot apply for X > 1/2. We see that 
in this example the central limit theorem applies for all X > 0, but the 
law of large numbers only if X < 1/2. 

( b) Consider two independent sequences of 1000 tossings of a coin 
(or emptying two bags of 1000 coins each). We want to investigate 

9 W. Feller, tlber den zentralen Grenzwertsat-z der Wahrscheinlichkeitsrechnung, 
Mathematische Zeitschrift , vol. 40 (1935), pp. 521-559. There also a generalized 
central limit theorem is derived which may apply to variables without expecta¬ 
tions. Note that we are here considering only independent variables; for dependent 
variables the Lindeberg condition is neither necessary nor sufficient. 



204 LAWS OF LARGE NUMBERS (10.6 

the difference D of the number of heads. Let the tossings of the two 
sequences be numbered from 1 to 1000 and from 1001 to 2000 , respec¬ 
tively. We define 2000 random variables Xk as follows: if the kth 
coin falls tails, then Xk = 0. If it falls heads, then we put = 1 or 
Xk = —1, according to whether Jc < 1000 or k > 1000. Then D 

= X\ + X 2 H-b^ 2 ooo- Moreover, pk = ± 1 / 2 , depending on 

the sequence to which the coin belongs, <?* = 1/4, ra 20 oo = 0, s 2 ooo 2 
= 500. Therefore the probability that the difference D will lie within 
the limits ±(500)^a is 4>(a) — 4>( —a), approximately. The random 
variable D is therefore comparable to the deviation S 2 ooo ~ 1000 of 
the number of heads in 2000 tossings from its expected number 1000 . 

(c) An application to the theory of inheritance will illustrate the great 
variety of conclusions based on the central limit theorem. In Chapter 5 
we have studied traits which depend essentially only on one pair of 
genes (alleles). We conceive of other characters (like height) as the 
cumulative effect of a great number of pairs of genes. For simplicity, 
suppose that with respect to each particular pair of genes the individual 
belongs to one of the three genotypes AA, A a, or aa . Let x x , x 2 , and x 3 
be the corresponding contributions to the height [if x 2 > (zi + x 3 )/2, 
the gene A is partially dominant]. The genotype of an individual is a 
random event, and hence the contribution of our particular pair of 
genes to the height is a random variable X ’, assuming the three values 
x lf x 2y x 3 with certain probabilities. The height is the cumulative 
effect of many such random variables X x , X 2 , • • *, X n , and since the 
contribution of each is small, we may in first approximation assume 
that the height is the sum X x + • • • + X n . It is true that not all the 
Xk are mutually independent. However, the central limit theorem 
holds also for large classes of dependent variables, and, besides, it is 
plausible that the great majority of the X\ can be treated as inde¬ 
pendent. These considerations can be rendered more precise; here they 
serve only as indication of how the central limit theorem explains why 
many biometric characters, like height, exhibit an empirical distribu¬ 
tion which is close to the normal distribution. This theory permits 
also the prediction of properties of inheritance, e.g., the dependence of 
the mean height of children on the height of their parents. Such bio¬ 
metric investigations of F. Galton and Karl Pearson laid the founda¬ 
tions of modern statistical theory. 

* 6. Applications to Combinatorial Analysis 

We shall give two examples of applications of the central limit 
theorem to problems which are not directly connected with probability 
theory. 

* Starred sections treat special topics and may be omitted at first reading. 



10.6] APPLICATIONS TO COMBINATORIAL ANALYSIS 


205 


In the following we consider the space whose points are the n! 
permutations of the elements a x , a 2 , • • •, a n and attribute to each per¬ 
mutation probability 1/nl 

(a) Inversions. The element a* is said to produce r inversions if it 
precedes r elements among a x , a 2 , • • •, a&_i (which precede it in the 
natural order). Thus in the permutation {a^a^aia b a 2 a^) the element a 2 
produces no inversions, a 3 two, a 4 none, as two, as four. The total 
number of inversions in this case is eight. In (a 6 a 5 a 4 a 3 a 2 a 1 ) there are 
fifteen inversions of which a * produces k — 1 (fc = 2 , 3, 4, 5, 6 ). Let 
.X& be the number of inversions produced by the element a*. Then Xk 
is a random variable and S n = + • • • + X n is the total number of 

inversions. Now Xk assumes the values 0,1, •••,& — 1, each with 
probability 1 /fc, and therefore 


k - 1 


Uk = 


( 6 . 1 ) 


<*k = 


1 + 2 2 H-h (fc — l) 2 

Jb 




fc 2 


12 


The number of inversions produced by a k does not depend on the 
relative order of <i \, a 2 , •••, ak-i, and hence the X k are mutually 
independent. From (6.1) we get 


( 6 . 2 ) 

and 

(6.3) 


m n = 


1 + 2 d-f- (n - 1 ) n(n - 1 ) 


s » 2 = L £ (A ; 2 - 1) = 

1-6 /c=l 


2n 3 + 3/i 2 — 5n 

~~~ — 


7r 
^ — 

4 

n 3 

36* 


For large n we have es n > n > Uk , and hence the variables C4 of the 
Lindeberg condition are identical with Xk. Therefore the central limit 
theorem applies, and we conclude that the number N n of permutations 

ft 2 a H 

for which the number of inversions lies between the limits — ± - n™ is, 

4 6 

asymptotically , given by n!j4>(a) — 4>( — a)}. In particular, for about 
one-half of all permutations the number of inversions lies between the 
limits (n 2 /4) ± (O.ll)a^. 

(b) Cycles . Every permutation can be broken down into cycles, 
that is, groups of elements permuted among themselves. Thus in 
(a 3 a 6 a 1 a 5 a 2 a 4 ) we find that a x and a 3 are interchanged, and that the 
remaining four elements are permuted among themselves; this permu¬ 
tation contains two cycles. If an element is in its natural place, it 
forms a cycle so that the identical permutation (a x , a 2 , • • *, a n ) contains 
as many cycles as elements. On the other hand, the cyclical permuta- 



206 


LAWS OF LARGE NUMBERS 


[10.6 


tions (o 2 , a 3 , • • •, a», «i), (a 3) a 4 , • - •, a n , a t , a 2 ), etc., contain a single 
cycle each. For the study of cycles it is convenient to describe a 
permutation by means of arrows indicating to which places the ele¬ 
ments have been moved. For example 1 —> 3 —> 4 —> 1 indicates that 
ai has been moved to the third place, a 3 to the fourth, and a 4 to the 
first; this completes the first cycle, and we continue with the next 
element in the natural order, namely, a 2 . Thus (a 4 a H aia 3 a 2 a 5 a 7 aQ) 
would be described by 1 — >3 —>4 —» 1 ; 2 —>5—> 6 —> 8 —>2; 7 —» 7; 
here we have three cycles of lengths 3, 4, 1, respectively. 

To study the number of cycles we let the random variables 
Xk (k = 1 , 2 , • • • , n) equal 1 or 0 according to whether a cycle is or is 
not completed at the kth step in our build-up. Thus in the last example 
X 3 = 1 , X 7 = 1 , X 3 = 1 while all other X k ’s equal 0 . Clearly X\ = 1 
if, and only if, a x moves to the first place. If a x remains at the first 
place, then X 2 = 1 if, and only if, a 2 remains at the second place. In 
general, at the kth step we have n — k + 1 choices of which one and 
only one leads to the completion of a cycle. It follows that Xk 
equals 1 with probability 10 1 /(n — k + 1 ) and 0 with probability 
(n — k)/(n — k + 1). Moreover, the X k are mutually independent 
and uniformly bounded. In our case 


(6.4) 

and hence 

(6.5) 
and 

( 6 . 6 ) 




n — k + l 9 


Ok = 


n — k 


(n — k + l ) 2 


1 1 1 

m n = 1 + - + -H- h-~logn 

2 3 n 


S„ 2 = z 


n — k 


k=i (ft — k + l ) 2 


' log n. 


S n = X ! + • • • + X n is the total number of cycles. The average number 
of cycles is m n ; asymptotically , the number of permutations for which the 
number of cycles lies between log n + a (log n)^ and log n + /5(log n)^ is 
given by n!(<f>( 0 ) — 4>(a)}. The refined forms of the central limit 
theorem give more precise estimates . 11 


10 Formally, the distribution of Xk depends not only on k but also on 41. It suf¬ 
fices to reorder the Xk, starting from k ** n down to k « 1, to have the distribution 
depend only on the subscript. 

11 A great variety of asymptotic estimates in combinatorial analysis were derived 
by other methods by V. Gon6arov, Du domaine d’analyse combinatoire, Bulletin 
de VAcad&mie Sciences URSS, Sir, Math . (in Russian, French summary), vol. 8 
(1944), pp. 3-48. The present method is simpler but more restricted in scope; cf. 
W. Feller, The Fundamental Limit Theorems in Probability, Bulletin of the Amer¬ 
ican Mathematical Society , vol. 51 (1945), pp. 80(1-832. 



10.7] 


THE STRONG LAW OF LARGE NUMBERS 


207 


* 7. The Strong Law of Large Numbers 

The (weak) law of large numbers (5.4) asserts that for every par¬ 
ticular sufficiently large n the deviation \ S n — m n \ is likely to be 
small in comparison to n. It has been pointed out in connection with 
Bernoulli trials (Chapter 8 ) that this does not imply that | S„ — m n \/n 
remains small for all large n; it can happen that the law of large num¬ 
bers applies but that | S n — m n \/n continues to fluctuate between 
finite or infinite limits. The law of large numbers permits only the 
conclusion that large values of | S n — m n \/n occur at infrequent 
moments. 

We say that the sequence Xk obeys the strong law of large numbers if to 
every pair e > 0, 8 > 0, there corresponds an N such that there is 
probability 1 — 8 or better that for every r > 0 all r inequalities 

I S n — m n \ 

(7.1) —-- < 6 , jn = N, N + l, • N + r 

n 

will be satisfied. 

We can interpret (7.1) roughly by saying that with an overwhelming 
probability | S n — m n | /n remains small 12 for all n > N. 

The Kolmogorov Criterion. The convergence of the series 



is a sufficient condition for the strong law of large numbers to apply to 
the sequence of mutually independent random variables X 

Proof. Let A v be the event that for at least one n with 2""” 1 < n < 2 V 
the inequality (7.1) does not hold. Obviously it suffices to prove that 
for all v sufficiently large ( v > log N) and all r 

Pr{A v \ + Pr{A v +i\ 4-b Pr{A v + r ] < 5. 

In other words, we have to prove the convergence of the series 'EPr{A v \. 
Now the event A v implies that for some n with 2"““ 1 < n < 2 V 

(7.3) | s n - m n I > ^ • 2" 

* Starred sections treat special topics and may be omitted at first reading. 

12 The general theory introduces a sample space corresponding to the infinite 
sequence {Xk |. The strong law then states that with probability one | S n — m n | /n 
tends to zero. In real variable terminology the strong law asserts convergence 
almost everywhere, while the weak law is equivalent to convergence in measure. 



208 


LAWS OF LARGE NUMBERS 


110.7 


and by Kolmogorov’s inequality (Chapter 9, section 7), we have 
therefore 


(7.4) 

Hence 


Pr{A v ) < 4e- 2 -sV-2- 2 ’\ 


£ Pr{A, } < 4e~ 2 £ ^ £ <r* 2 = 4 e ~ 2 £ cr fc 2 E 2~ 2 " 

y*»l V =1 &—1 2 V ^ k 


(7.5) 


00 2 

< 8e- 2 E Jr 

/c=l A 

which accomplishes the proof. 

As an application of Kolmogorov’s criterion we shall prove the 

Theorem. If the sequence of mutually independent random variables 
Xk have a common distribution \f(xj )} and if n = E(X k ) exists , then the 
strong law of large numbers applies. 

This theorem is, of course, stronger than the weak law of section 1. 
The two theorems are treated independently because of the method¬ 
ological interest of the proofs. For a converse cf. problem 11. 

Proof . We again use the method of truncation. Two new sequences 
of random variables are introduced by 


(7.6) 


14 = X k) V k = 0 if I X k I < k, 
U k = 0 , V k = X k if \x k \> k. 


Then the TJ k are mutually independent, and we shall show that they 
satisfy Kolmogorov’s criterion. We have 

(7.7) <r* 2 < E(U k 2 ) = E *//(*;)• 

I xj | <k 

Now for abbreviation put 

(7.8) a, = E I I /(*;)• 

V - 1 £ 1 Xj I < V 

Then the series 2a„ converges since E(X k ) exists. Moreover, from (7.7) 

(7.9) <rfc 2 a\ -f" 2 c& 2 "f" 3 c& 3 -f- • • • -4* ka k 
and hence 

00 2 00 1 A 00 00 1 00 

(7.10) Z -jo Z W- = Z Z Tj < 2 z < *• 

/e—1 AJ # Vx= l v = l k=V k Vmml 



209 


10.8] PROBLEMS FOR SOLUTION 

For the expectation we have 

(7.11) E(U k ) = n k - £ *,/(*,) 

I *1 I < * 


so that Mfc —> m and hence (mi + M 2 + • • • + Mn)A* —> m. Applying 
the strong law of large numbers to {£ 4 }, we find therefore that with 
probability 1 — 8 or better 


(7.12) 



n 


£ £4-m 

k= 1 


< € 


for all n > N. It suffices now to prove that the V n can be neglected, 
that is, that the probability of one or more V n with n > N being 
different from zero tends to 0 with N —> oo. It is easily seen that the 
first Borel-Cantelli lemma (Chapter 8 , section 3) applies with obvious 
verbal changes, and that it suffices to prove that 2 Pr[V n 0} con¬ 
verges. Now 


(7.13) Pr\V n * OJ 
and hence 


Z /(*;) < 

|x>|>n 


a n +1 

n 


+ 


«n-f2 

nTT 


+ 


On-|-3 

n + 2 


+ ... 


(7.14) SPr{F„7M>|<£ f — - £—£ 1 «£«.+.< ■>, 

n=l v=n V v = l v n = 1 v 

as asserted. 


8 . Problems for Solution 

1. Prove that the law of large numbers applies in example (5.a) also when 
X < 0. The central limit theorem holds if X > — 1/2. 

2. Decide whether the law of large numbers and the central limit theorem hold 
for the sequences of mutually independent variables Xk with distributions defined 
as follows (k > 1): 

(а) Pr\X k - ±2*1 = y 2 ; 

(б) Pr{X k = ±2*1 = 2~® k+1 \ Pr\X k = 0) = 1 - 2-=*; 

(c) Pr\X k = ±fc) = Pr\X k = 0) = 1 - 

3. Ljapunov’s condition (1901). Suppose that for some fixed 5 > 0 one has 
E(\ Xk | 2+5 ) «= \k and that \k/ck 2 is uniformly bounded. Show that Lindeberg’s 
condition is satisfied. 

The following six problems treat the weak law of large numbers for dependent 
variables . 

4. Let the {X*| be mutually independent and have a common distribution 
with mean n and finite variance. If S n — -Xi H— • •+* X n , prove that the law of 
large numbers does not hold for the sequence { S n ), but holds for a n S n if a n —► 0. 



210 


LAWS OF LARGE NUMBERS 


[10.8 


5. Let {Xk\ be a sequence of random variables such that Xk may depend on 
Xk-i and Affc+i but is independent of all other Xj. Show that the law of large 
numbers holds, provided the Xk have finite variances. 

6. If the joint distribution of (Xi, • • *, X n ) is defined for every n, if the variances 
are bounded, and all covariances are negative, the law of large numbers applies. 

7. Continuation . Replace the condition Cov(ATy, Xk) < 0 by the assumption 
that Cov(Xj, Xk) —> 0 uniformly as | j — k | —> 0. Prove that the law of large 
numbers holds. 

8. If | S n | < cn and Var(S n ) > an 2 , then the law of large numbers does not 
apply to [Xk). 

9. In the Polya urn scheme [example (2.c) of Chapter 5] let Xk equal 1 or 0 
according to whether the fcth ball drawn is black or red. Then S n is the number of 
black balls in n drawings. Prove that the law of large numbers does not apply to 
[Xk\. {Hint: Use problem 8 and Chapter 9, problem 13.) 

10. Let {X n } be a sequence of mutually independent random variables with a 
common distribution. Suppose that the X n have not a finite expectation and let 
A be a positive constant. The probability is one that-infinitely many among the 
events | X n | > An occur. 

11. Converse to the strong law of large numbers. Under the assumption of prob¬ 
lem 10 there is probability one that | S„ | > An for infinitely many n. 

12. Example of an unfavorable “fair” game. Let the possible values of the gain 
at each trial be 0, 2, 2 2 , 2 3 , ...; the probability of the gain being 2* is 

(8-1) Pk = 2 k k(k + 1) ’ 


and the probability of 0 is po — 1 — (pi + P2 + • • •)• The expected gain is 

(8.2) m - 22. (1 - H) + 0i - X) + (X ~ H) +. •. - 1. 


Assume that at each trial the player pays a unit amount as entrance fee, so that 
after n trials his net gain (or loss) is S n — n, where S n is the sum of n independent 
random variables with the above distribution. Show that for every e > 0 the 
probability approaches unity that in n tr ials the player will have sustained a loss greater 
than (1 — e)n/Log 2 w, where Log 2 n denotes the logarithm to the base 2. In sym¬ 
bols, prove that 


(8.3) 


Pr IS.-. < -51^2} 
l Log2n J 


1 . 


Hint: Use the truncation method of section 4, but replace the bound nLogn 
of (4.2) by n/Log 2 n. Show that the probability that 14 * Xk for all Jc < n tends 
to 1 and prove that 


(8.4) Pr j| lh +• • • + V n - nEm | < 1. 


(8.5) 


1 


1 

Log2n 


> E{Uy) > 1 - 


1 + « 

Log 2 n 


For details cf. the paper cited in section 2. 



10 . 8 ] 


PROBLEMS FOR SOLUTION 


211 


13. A converse to Kolmogorov's criterion. If 'Eaj c 2 /k 2 diverges, then there exists a 
sequence { Xk ) of mutually independent random variables with Varj^l = <rj? for 
which the strong law of large numbers does not apply. [Hint: Prove first that the 
convergence of SPr (| X n | > m ) is a necessary condition for the strong law to 
apply]. 

14. Let {AT n ) be a sequence of mutually independent random variables such that 
X n = ±1 with probability (1 — 2~ n )/2 and X n = ±2 n with probability 2~ n ~ 1 . 
Prove that both the weak and the strong law of large numbers apply to {AT*). 
[Note. This shows that the condition (5.7) is not necessary.] 



CHAPTER 11 


INTEGRAL VALUED VARIABLES 
GENERATING FUNCTIONS 


1. Generalities 

Among discrete random variables those assuming only the integral 
values k = 0, 1, 2, ... are of special importance. In particular, all 
recurrence and waiting times are of this nature. The study of such 
variables is facilitated by the powerful method of generating functions 
which will later be recognized as a special case of the method of char¬ 
acteristic functions on which the theory of probability depends to a 
large extent. More generally, the subject of generating functions 
belongs to the domain of operational methods which are widely used 
in the theory of differential and integral equations. In the theory of 
probability generating functions have been used since DeMoivre and 
Laplace, but the power and the possibilities of the method are rarely 
fully utilized. 


Definition . Let ao, «i, a 2 , ... be a sequence of real numbers. If 
(1.1) A(s) = a Q + ais + a 2 $ 2 + ... 


converges in some interval — s 0 < s < s 0 , then the function A(s) is called 
the generating function of the sequence {aj}. 

The variable s itself has no significance. If the sequence {%} is 
bounded, then a comparison with the geometric series shows that (1.1) 
converges at least for j s | <1. 


Examples. If aj = 1 for all j, then A (s) = 1/(1 — s). The generat¬ 
ing function of the sequence (0, 0, 1, 1, 1, ...) is s 2 /( 1 — s). The se¬ 
quence aj = 1 fj\ has the generating function e 3 . For fixed n the 


sequence aj = 



has the generating function (1 + s) n . 


If X is the 


number scored with a throw of a perfect die, the probability distribu¬ 
tion of X has the generating function (s + s 2 + s z + s 4 + s 5 + s 6 )/6 
= 8(1 — 8 6 )/6(l — s). 


212 



11.1] 


GENERALITIES 


213 


Consider now an integral-valued random variable with the prob¬ 
ability distribution 

(1.2) Pr[X = j\ = Vh j = 0, 1, .... 

It will be convenient to have a special notation for its “tails ,” and we 
shall write 

(1.3) Pr{X >j\ = q h 
so that 

(1.4) qic = Pk+ i + Vk+ 2 + ..., k > 0. 

We introduce the two generating functions 

(1.5) P(s) = p 0 + pis + p 2 s 2 + p 3 s 3 + • • . 
and 

(1.6) Q(s) = qo + qiS + <? 2 « ?2 4- q^s 3 + .... 

The two series converge at least for | s | < 1 since their coefficients are 
bounded. Moreover, (1.2) represents a probability distribution so 
that P( 1) = 1 and hence P(s) converges absolutely at least for 
-1 < s < 1. 


y 


Theorem 1. 


(1.7) 


For —l<s<lwe have 


Q(s) = 


l - Pi f) 

I — S 


Proof. From (1.4) we conclude that q k = 1 — p 0 — pi — ••• — Pk- 
Introduce these expressions into (1.6) and collect like terms. The 
positive terms add to 1 + s + s 2 + ... = 1/(1 — s). The term ~pj 
appears as a summand of qj , qj+i, qj+ 2 , ... so that in (1.6) ~Pj appears 
as factor of s J + s J+1 + s J+2 + ... = $ J /(1 — s )• Hence Q(s) 
= (1 — s) _1 (l — Po — PiS - P 2 S 2 — ...) = (1 - s)~ 1 {1 - PCs)}, as as¬ 
serted in (1.7). 

Theorem 2. The expectation E(X) can be calculated, either from the 
probability distribution (1.2) or in terms of the “tails” (1.3). Thus 

00 00 

(1.8) E(X)='ZjPi= E?*. 

j =1 

In terms of the generating functions 

(1.9) E(X) = P'(l) = Q(l). 



214 


GENERATING FUNCTIONS 


[11.2 


It is understood that the two series in (1.8) may diverge. Since all 
terms of the series are positive, we may without danger speak of an 
infinite expectation . In this case (1.9) is to be replaced by the statement 
that P'(s) and Q(s) tend to oo as s —> 1. 

Proof . If E(X) is finite, then the relation E(X) = P'( 1) follows 
from the definition. That Q( 1) = P'(l) follows by differentiating the 
identity 1 — P(s) = (1 — s) Q(s ) of theorem 1. If E(X) is infinite, 
then P'(s) and Q(s ) increase over all bounds as s —» 1. The equations 
(1.8) are, of course, the same as (1.9). The fact that 2^ = Xjpj can 
be proved independently by the argument used to prove theorem 1. 

Theorem 3. If EiX 2 ) = 2 n 2 p n is finite , then 

(1.10) E(X 2 ) = P"(l) + P'(l) = 2Q'(1) + Q( 1) 
and hence 

(1.11) Var(JT) = P"(l) + P'(l) - P' 2 (l) 

= 2Q'(1) + Q( 1) - Q 2 ( 1). 


The variance is infinite if and only if P"(s) <x> as s- 

Proof . The first expression in (1.10) follows from 


1 . 


( 1 . 12 ) 


E(X 2 ) = 2 k 2 p k = 2 k(k - 1 )p k + 2 kp k . 




VO) 


That -the two representations in (1.10) are equivalent is seen upon 
differentiating the identity 1 — P(s) = (1 — s)Q(s) twice. 


2. Convolutions 

Let X and Y be non-negative independent integral-valued random 
variables with the probability distributions Pr{X = j) = aj and 
Pr{Y = j\ = bj. The event (X = j, Y = k ) has probability a^. 
The sum S = X + Y is a new random variable, and the event S = r 
is the union of the events 

(X = 0, Y = r), (J = 1, F = r - 1), (J - 2, 7 = r - 2), •• •, 

= r, y = 0). 

These events are mutually exclusive, and therefore the distribution 
c r = Pr{ S = r} is given by 

(2.1) c r = aob r + aib r ^.i + a 2& r _2 + ■ • • + a r — i&i + u r 6o. 

The operation (2.1), leading from the two sequences {a*} and 
to a new sequence {c*}, occurs so frequently that it is convenient to 
introduce a special name and notation for it. 



11.2] 


CONVOLUTIONS 


215 


J Definition . Let {a k } and {&*} be any two number sequences {not 
necessarily probability distributions). The new sequence {c r J defined by 
(2.1) is called the convolution 1 of {a k \ and {6 fc } and will be denoted by 


( 2 . 2 ) 


M = M*IM 


Examples, (a) If a k = b k = 1 for all k > 0, then c k = k + 1. If 
d k = k, b k = 1, then Cfc=l + 2+*«*+/c = fc(fc + l)/2. Finally, if 
a 0 = «i = Vi, d k = 0 for fc > 2, then c k = (&* + 6fc-i)/2, etc. 


Let now the sequences { a k } and {&*} have generating functions 
A(s) = 2 a k s k and B(s) == 2 b k s k . The product A(s)B(s) can be ob¬ 
tained by termwise multiplication of the power series for A(s) and B(s). 
Collecting terms with equal powers of s, we find that the coefficient c r 
of s r in the expansion of A(s)B(s) is given by (2.1). We have thus the 

/ Theorem. If {«&} and \b k } are sequences with generating functions 
A{s) and B(s ), and [c k J is their convolution , then the generating func¬ 
tion C{s) — 2 c k s k is the product 

(2.3) C{s) = A{s)B{s). 

If X and Y are non-negative integral-valued mutually independent random 
variables with generating functions A(s) and B(s), then their sum X + Y 
has the generating function A{s)B{s). 

Let now { a k ) , [b k ], {c k \, {<4|, ... be any sequences. We can form 
the convolution {a k ]*{b k \, and then the convolution of this new se¬ 
quence with { c k }, etc. The generating function of { a k } * { b k } * { c k } * { d k } 
is A(s)B(s)C(s)D(s) y and this fact shows that the order in which the 
convolutions are performed is immaterial. For example, {afc}*{6fc}*{cfc} 
== \ck}*{b k \*{a k }, etc. Thus the convolution is an associative and com¬ 
mutative operation (exactly as the summation of random variables). 

In the study of sums of independent random variables X n the special 
case where the X n have a common distribution is of particular interest. 
If { aj } is the common probability distribution of the X ny then the distribu¬ 
tion of S n = Xi + • • • + X n will be denoted by {ayj n *. Thus 

(2.4) {a,} 2 * = {ay)*{ay|, {ayj 3 * = {ay) 2 **{ay}, ... 

and generally 

(2-5) {ay)"* = {ay) (n-1) * *{ay). 

1 Some American and British writers prefer the word faltung , which is of German 
origin. The French equivalent is composition. 



216 


GENERATING FUNCTIONS 


[11.2 


In words, {ay} n * is the sequence of numbers whose generating function is 
A n {s). In particular, {ay} 1 is the same as {ay}, and {ay} 0 is the 
sequence whose generating function is A°(s) = 1, that is, the sequence 


( 1 , 0 , 0 , 0 , ...). 

Examples, (b) Binomial Distribution . 
the binomial distribution b(k; n, p) = 



The generating function of 
p k q n ~ k * s 


(2-6) £ (”) (ps) k q n ~ k = (? + ps) n - 

0 w 

The fact that the generating function is the nth power of q + ps shows 
that \b(k; n, p) } is the distribution of a sum S n = X x + • • • + X n of n 
independent random variables with the common generating function 
q + ps; each variable Xj assumes the value 0 with probability q and the 
value 1 with probability p. Thus 

(2.7) m;n,p) } = {b(k; l,p)}»*. 

We have used the representation S n = X x + • • • + X n several times 
[cf. Chapter 9, examples (3.a) and (5.a)]. The above argument can 
be reversed and used for a new derivation of the binomial distribution. 
The multiplicative property (q + ps) m (q + ps) n = {q + ps) m + n shows 
also that 

(2.8) {b(k',m,p)}*{b(k;n,p)} = {b(k; m + n, p)\. 

We have thus a new derivation of the result of Chapter 6, problem 22. 
Also the formulas E(S n ) == np and Var(S n ) = npq can be derived in a 
simple way by differentiation of the generating function ( q + ps) n in 
accordance with (1.9) and (1.11). 

(c) Poisson Distribution . The generating function of the distribu¬ 
tion p(k; X) = e"~ x xVfc! is 

00 (\^ k 

(2.9) = 

k—Q k\ 

It follows that 

(2.10) |p(fc;X)}*{p(fc; M )} = {p(k;\ + p.)}, 

which can also be proved directly (cf. Chapter 6, problem 23). By 
differentiation we find that both mean and variance of the Poisson 
distribution are X; this result has been proved previously by direct 
computation [cf. Chapter 9, example (4.c)]. 



11.3] PASCAL DISTRIBUTION 217 

3. The Geometric and the Pascal Distributions 

We say that the random variable X has a geometric distribution if 

(3.1) Pr{X = k\ = g k p, k = 0,1,2, ... 

where p and q are positive constants with p + q = 1. The corresponding 
generating function is 

00 

(3.2) p E (?«)* = T~~ 

k =o 1 — qs 

Using theorems 2 and 3 of section 1, it is easily found that 

(3.3) £(*) = -, Var(AT)=4- 

P V 

To fix ideas consider a sequence of n Bernoulli trials with probability 
p for success. The probability that at least one success occurs and that 
the first success is preceded by exactly k failures (k < n — 1) is qfp. 
A passage to the limit leads us to interpret X as the number of failures 
preceding the first success when the trials are continued as long as 
necessary for a success to occur, but this interpretation refers to an 
infinite sample space. The advantage of the formal definition (3.1) 
is that we need not worry about the structure of the original sample 
space. 

We can in a similar way study the number of trials preceding the 
rth success. Let Xk be the number of failures following the ( k — l)th 
and preceding the A;th success. Then X x + X 2 + • • • + X r + r is the 
number of trials up to and including the rth success. Strictly speaking, 
these variables are defined only in the sample space corresponding to 
unending sequences of Bernoulli trials studied in Chapter 8. However, 
we can study the distribution directly without any reference to a par¬ 
ticular sample space. The notion of Bernoulli trials implies that the va¬ 
riables X X} • • •, X r are mutually independent and that they have the 
common distribution (3.1). The sum 

(3.4) S r = X x + • • • + X r 

can be interpreted as the number of failures preceding the rth success. 
Then 

(3.5) Pr{S r = k\ =/(fc;r,p) 

is the probability of the rth success occurring at the trial number r + k. 
This means that among the first r + k — 1 trials there are exactly k 



218 GENERATING FUNCTIONS ' [11.4 

failures and the following, or (r + A:)th, trial results in success; the 

( V j jjj - 

) p r ~ 1 q k and p, so that 

(3.6) /(A;r,p) = ( r + * _1 )pV. 

The same result follows directly from (3.4). In fact, from (3.2) we 
find that the generating function of S r is 



Using the binomial formula ( 6 . 6 ) of Chapter 2, it is seen that the co¬ 
efficient of s k is 

(3.8) f(k ; r, p) = (“ r ) p r (-q) k , k = 0, 1, 2,... , 

and this formula agrees with (3.6) (cf. Chapter 2, problem 1 of sec¬ 
tion 9). 

The distribution f(k; r, p) defined by either (3.6) or (3.8) i s called the 
Pascal distribution . It depends on the two parameters r and p, where r 
is an integer and 0 < p < 1 (q = 1 — p). Its generating function is 
(3.7). The Pascal distribution is the r-fold convolution of the geometric 
distribution (3.1), or in symbols 

(3.9) {f(k;r,p)\ = {q k pV*. 

Conversely, the geometric distribution (3.1) is the special case r = 1 
of the general Pascal distribution. It follows that the mean and 
variance of the Pascal distribution are rq/p and rq/p 2 . Moreover, 

(3.10) {/(fc; r 1 ( p)}*{/(*; r 2 , p)) = {f(k; r x + r 2 , p)}. 

The Pascal distribution remains meaningful if r is not an integer 
provided that r > 0. It occurs in many connections and is often 
called the negative binomial distribution. [An alternative form of the 
Pascal distribution with non-integral r is given by the Polya distribu¬ 
tion ( 8 . 12 ) of Chapter 6 , where \p corresponds to r and 1/(1 + p) to 5 .] 

4. Relation to Holding or Waiting Times 

If a sequence of Bernoulli trials is performed at the rate of one per 
second, then the number X of failures preceding the first success 
represents a waiting time. Its probability distribution has a curious 
property which puzzles unprepared minds. 



11.4] 


HOLDING TIMES 


219 


Suppose that in a particular sequence no success occurred in the 
first v trials, so that X > v. The waiting time from this trial to the 
next success is independent of the number of preceding failures. Hence 
the probability that the waiting time will be prolonged by an addi¬ 
tional k seconds is independent of v and equals the initial probability 
of the total length exceeding k seconds. It is obvious that this property 
is not shared by waiting times encountered in phenomena such as 
waiting lines before counters and lifetimes of machines. For example, 
the information that no streetcar has passed for five minutes ordinarily 
increases our expectation that a car will pass during the next minute, 
and a reasonable theory of waiting times must reflect this common- 
sense statement. However, there exist many phenomena where the 
waiting times share the described property of the waiting time for the 
first success. A typical example is provided by a type of conversation 
in telephone booths which has often been cited as an ideal illustration 
of incoherence. If the booth is equipped with a tolerably comfortable 
seat eliminating physical fatigue, then the conversation must be re¬ 
garded as typical of a process which depends entirely on momentary 
impulses. Whatever happens bears no relation to the past, and the 
termination is an instantaneous chance effect. The probability that 
it will occur within the next minute must be independent of the length 
of the foregoing chatter, so that the length of the conversation is 
analogous to the waiting time for the first success. 

The general theory of stochastic processes (Chapters 15 and 17) will 
reveal that we are here dealing with an important but particular 
case of Markov processes. There is only one possible probability 
distribution for waiting times with the described property; in the case 
of discrete trials it is the geometric distribution, and in the case of a 
continuous time it is the exponential distribution. Here we are limited 
to discrete trials and proceed to show that our property is characteristic 
of the geometric distribution. 

Suppose, then, that a waiting time X can assume the values 0, 1 , 

2, ... with probabilities p 0 , p u p 2 , - Let the distribution of X have 

the following property: the conditional probability that the waiting time 
terminates at the kth trial , assuming that it has not terminated before , 
equals p 0 (the probability at the first trial). We claim that pk 
= (1 — po) k po, so that X has a geometric distribution. 

For a proof we introduce again the “tails” 

Qk = Vk+ i + Pk +2 + Pfc +3 +... = Pr[X > k}. 

Our hypothesis is X > k — 1, and its probability is qk-i- The condi¬ 
tional probability of X = k is therefore Pk/qk- i, and the assumption 



220 


GENERATING FUNCTIONS 


[11.4 


is that for all k > 1 
(4.1) 


Pk 

-= p 0 . 

Qk -1 


Now pk = qk -i — Qk, and hence 


Qk 

(4.2) — - 1 - po. 

ft-i 

Since q 0 = Pi + P 2 + • • • = 1 — Po, it follows that qk = (1 — po)* +1 , 
and hence pk — qk -1 — qk = (1 — Po) k Po , as asserted. 

It will be useful to indicate how a simple passage to the limit leads 
from the geometric distribution { q k p } to the exponential distribution. 
Suppose that each trial takes a time A t, so that k trials take time kAt. 
We shall let At —> 0 and at the same time k —> 00 , so that the total 
time consumed remains constant, kAt = t. If p and q were to remain 
constant, the first success would occur sooner and sooner, and in the 
limit we would be sure to have waiting time zero. In a sensible limiting 
process we must therefore let p — > 0 and q —» 1 , exactly as was done in 
the derivation of the Poisson distribution in Chapter 6 . The mean of 
the geometric distribution was found to be q/p [cf. (3.3)]. If each 
trial takes time At then the mean of the waiting time is {q/p) At, and 
the process remains physically meaningful if (q/p)At is a constant. 
Since p —•> 0, q —> 1 , we can put 

(4.3) p ~ XA t. 

Then 

(4.4) cfp ~ (1 — \At) t/M \At. 

Taking logarithms, the first factor is seen to tend to e” M , and hence 

(4.5) (£p ~ e~ w XAJ. 

For the probability that the first success occurs after time t — k* At we 
get in the limit 

(4.6) q k ~( 1 - XA t) i/M -> 

This is the exponential distribution . We have encountered it as the 
first term in the Poisson distribution (Chapter 6), where e~** was 
interpreted as the probability of no event within time t, which is just 
another way of saying that the waiting time exceeds t. More generally, 
with our passage to the limit kAt = t, p ~ XA t, we have from (3.6) 



11 . 5 ] 


COMPOUND DISTRIBUTIONS 


221 


(4.7) f(k; r, p) ~ e~ u - -— • XA<. 

(r - 1 )! 

The factor */(r — 1 )! is the Poisson expression for the prob- 

ability of exactly r — 1 events within time t; the factor XA t stands for 
the probability of an additional event within the following interval of 
length A t. We have thus a new argument leading to the Poisson dis¬ 
tribution and relating it to the theory of waiting times (or of addition 
of random variables). This theory will be continued in Chapter 17 
and, more systematically, in volume 2 . 

5. Compound Distributions 

Let { Xk } be a sequence of mutually independent random variables 
with the common distribution Pr\X k — j\ = fj ( j = 0, 1, 2, ...). We 
are often interested in sums = X\ + X 2 + • • • + X N , where the 
number N of terms is a random variable which is independent of the 
Xj and has a given distribution Pr\N = n\ = g n . The distribution of 
S N follows from the fundamental formula for conditional probabilities 
[Chapter 5, (1.6)] 

OC 

(5.1) Pr{S^ = j)= E MN = n) Pr{X 1 +•••+ X n = j); 

71=0 

here the distribution of X x + • • • + X n is given by the n-fold convolu¬ 
tion of {fj} with itself, so that (5.1) can also be written in the form 

(5.2) Pr{S N = j\ = Zg n {/,}”*. 

n= 0 

Distributions of this type are called compound distributions. 
The most important case is that where N has a Poisson distribution 
while the variables Xk can assume only the values 1 and 0 with prob¬ 
abilities p and q. In this case S n = X x + • • • + X n has the binomial 

distribution b(j; n, p) = ^ p 3 q n ~ } , and g n = e -x -X n /nl. Then 

(5.3) Pr {S N = j}= e-y E - (”) q n ~ j 

n—0 «! Vj/ 

= “ (X?)"~^ 

j! h (n -j)\ 

The last sum equals e Xa = e x(1-p) , and hence the composition of a 
Poisson distribution for N and a binomial distribution for Xi + • • • + X n 



222 


GENERATING FUNCTIONS 


[11.5 


leads to the compound distribution 

Chv) } 

(5.4) Pr{S n =j} = e” x,> -; 

that is, a Poisson distribution with parameter \p. 

Examples, (a) We saw in Chapter 6, example (6.c), that X-rays 
produce chromosome breakages in cells: for a given dosage and time 
of exposure the number N of breakages in each cell is a random variable 
with a Poisson distribution. Each breakage has a fixed probability q 
of healing; with probability p = 1 — q a breakage does not heal, in 
which case the cell dies. The probability that j out of N breakages 
in a cell will not heal is given by the binomial distribution b(j; N, p), 
and hence the number S n of observable (unhealed) breakages has the 
distribution (5.4). 2 

(b) Compound distributions of the form (5.2) occur in particular 
in connection with composite populations such as encountered in 
Chapter 5, section 2. For example, if g n is the probability of a family 
having exactly n children, and if the sex ratio of boys to girls is p\q 
(p + q = 1), then the probability of a family having exactly,; boys is 
the compound binomial distribution 



Again, if an insect has probability g n of laying exactly n eggs and each 
egg has a probability p of survival, then (assuming statistical inde¬ 
pendence) (5.5) is the probability that exactly j eggs will survive. 

(c) Let Xj represent the damage caused by the jth lightning hit. 
If g n is the probability of exactly n strikes, then (5.2) is the probability 
distribution of the accumulated damage. Similar accumulated chance 
effects occur in insurance, physics, engineering, and elsewhere. 

Formula (5.2) is not pleasing, but it can be simplified by the use of 
generating functions. Let f(s) = 2fjS J and g(s) = 2 g n s n be the gen¬ 
erating functions of the Xk and N, respectively. Then for a fixed n 
the generating function of the n-fold convolution {fj} n * is f n (s) (cf. 
section 2), and hence the generating function of (5.2) is 2g n f n (s). This 
is the function g(s) with s replaced by/(s). Hence we have the 

* Cf. D. G. Catcheside, Genetic Effects of Radiations, Advances in Genetics , edited 
by M. Demerec, vol. 2, Academic Press, New York, 1948, pp. 271-358, in particular 
p. 339. 



11 . 6 ] 


CHAIN REACTIONS 


223 


Theorem. The generating function of the compound distribution ( 5 . 2 ) 
is the compound function g(f(s)). 

Examples, (d) In the special case (5.4) the generating function 
of the Xj was q + ps , and g(s) = e“ x+Xs . Hence g(f(s)) = e - x + x «+ x P* 

_ g-xp+xp^ j n a g reemen t with ( 5 . 4). 3 

(e) The generating function of (5.5) is g(q + ps). Suppose, in 
particular, that { g n \ is the geometric distribution {(1 — 7 ) 7 "}. Then 
g(s) = (1 - 7 )/ (1 “ 7 s) and g(q + ps) = (1 - a)/(l - as) with 
a = yp /(1 — yq). This implies that the combination of a geometric 
and a binomial distribution leads to a new geometric distribution 
(see, for an example, problem 9 of Chapter 5). 

As an application let us calculate the expectation of the distribution 
(5.2). By the chain rule of differentiation this expectation is 
£'(/( 1))/'(1). However, /'(1) = E(Xj) and </'(/( 1 )) = p'(l) = 2?(JV). 
Hence expectation of the distribution (5.2) is the product E(N)E(Xj). 

* 6 . Chain Reactions 

We shall now analyze a chance process which serves as a simplified 
model of many empirical processes and also provides an excellent 
illustration of one way in which generating functions are useful. In 
words the process may be described as follows. 

We consider particles which are able to produce'new particles of like 
kind. A single particle forms the original , or zero , generation. Any 
particle has probability Pk (k = 0, 1, 2, ...) of creating exactly k new 
particles; the direct descendants of the nth generation form the (n + 1 )th 
generation . The particles of each generation act independently of each 
other. 

Three typical illustrations may precede a rigorous formulation in 
terms of random variables. 

Examples, (a) Nuclear Chain Reactions. This application became 
familiar in connection with the atomic bomb . 4 The particles are 

* Starred sections treat special topics and may be omitted at first reading. 

3 For limit theorems concerning sums X\ 4*... -f Xn where AT is a random vari¬ 
able cf. H. Robbins, The Asymptotic Distribution of the Sum of a Random Num¬ 
ber of Variables, Bulletin of the American Mathematical Society , vol. 54 (1948), pp. 
1151-1161. For other general classes and properties of compound distributions cf. 
W. Feller, On a General Class of “Contagious” Distributions, Annals of Mathe¬ 
matical Staiisticsy vol. 14 (1943), pp. 389-400. 

4 The following description follows E. Schroedinger, Probability Problems in 
Nuclear Chemistry, Proceedings of the Royal Irish Academy , vol. 51, sect. A, No. 1 
(December, 1945). There the assumption of spatial homogeneity is removed. 



224 


GENERATING FUNCTIONS 


[ 11.6 


neutrons, which are subject to chance hits by other particles. On 
fission, each particle gives birth to a fixed number, m, of direct descend¬ 
ants. Let p be the probability that a particle sooner or later scores a 
hit; then q = 1 — p is the probability that the particle remains inactive 
(is removed or absorbed in a different way). In this case the only 
possible numbers of descendants are 0 and m, and the corresponding 
probabilities are q and p (i.e., p 0 = q, p m = V > Vi = 0 for all other j). 
.At worst, the first particle remains inactive and the process never 
starts. At best, there will be m particles of the first generation, m 2 of 
the second, and so on. If p is near one, the number of particles is 
likely to increase very rapidly. Mathematically, this number may 
increase indefinitely. Physically speaking, for very large numbers of 
particles the probabilities of fission cannot remain constant, and also 
statistical independence no longer holds. However, for ordinary chain 
reactions, the mathematical description “indefinitely increasing number 
of particles” may be translated by “explosion.” 

(6) Survival of Family Names. Here (as often in life), only male 
descendants count; they play the role of particles, and pk is the prob¬ 
ability of a newborn boy's producing exactly k boys. Our scheme 
introduces two artificial simplifications. Fertility is subject to secular 
trends, and therefore the distribution {pk } in reality changes from 
generation to generation. Moreover, common inheritance and common 
environment are bound to produce similarities among brothers, a fact 
which is contrary to our assumption of statistical independence. Our 
model can be refined to take care of these objections, but the essential 
features remain unaffected. We shall derive the probability of finding 
k carriers of the family name in the nth generation and, in particular, 
the probability of an extinction of the line. Survival of family names 
appears to have been the first chain reaction studied by probability 
methods. The problem was first treated by F. Galton (1889); for a 
detailed account the reader is referred to A. Lotka's book. 6 Lotka 
shows that American experience is reasonably well described by the 
distribution p 0 = 0.4825, pk = (0.2126)(0.5893)*“* (A; > 1), which, 
except for the first term, is a geometric distribution. 

( c ) Genes and Mutations. Every gene of a given organism (Chapter 
5, section 5) has a chance to reappear in 1, 2, 3, ... direct descendants 
of that organism, and our scheme describes the process, neglecting, of 
course, variations within the population and with time. This scheme 
is of particular use in the study of mutations, 6 or changes of form in a 

• Throne analytique des associations biologiques, Vol. 2, ActualiUs scientifiques 
et indu8trielle8 t No. 780 (1939), pp. 123-136, Hermann et Cie, Paris. 

* R. A. Fisher, The Genetical Theory of Natural Selection , Oxford, 1930, pp. 73ff. 



11 . 6 ] 


CHAIN REACTIONS 


225 


gene. A spontaneous mutation produces a single gene of the new kind, 
which plays the role of a zero-generation particle. Our theory leads to 
estimates of the chances of survival and of the spread of the mutant 
gene. To fix ideas concerning the forms which the distribution {pk\ 
may assume, consider (following R. A. Fisher) a corn plant which is 
father to some 100 seeds and mother to an equal number. If the 
population size remains constant, an average of 2 among these 200 
seeds will develop to a plant. Each seed has probability 1/2 to receive 
a particular gene. The probability of a particular mutant gene be f >> 
represented in exactly k new plants is therefore comparable to thdr 
probability of exactly k successes in 200 Bernoulli trials with prob¬ 
ability p = 1/200. Here the Poisson approximation applies, and it 
appears reasonable to assume that {pk\ is, approximately, a Poisson 
distribution with mean 1. If the gene carries a biological advantage, 
we are led to assume a Poisson distribution with mean X > 1. 

For a mathematical description of the chain reaction we introduce 
the random variable X n representing the^ number of particles in the 
nth generation. Then X 0 = 1 , while X x has the given probability 
distribution {p/t}. The number of descendants of each of the X x par¬ 
ticles of the first generation is a random variable with the same dis¬ 
tribution. Therefore X 2 = U x + U 2 + • • • + U Xl where each Uj has 
the distribution {p*}. By assumption these t4 are mutually inde¬ 
pendent. Now 

00 

(6.1) P(s) = E VkS k 

k—Q 

is the common generating function of the Uj and Therefore the 
generating function of X 2 is P 2 (s) = P(P(s)) by the theorem of the pre¬ 
ceding section. In like manner, the -J^-particles of the third generation 
are second-order descendants of the ^-particles of the first generation, 
so that X 2 is the sum of Xi mutually independent variables each having 
the same distribution as X 2 . This means that the generating function 
of X 3 is P 3 (s) — P(P 2 (s)). The same argument shows that in general 
the generating function P n +i(s) of the number X , l+1 of particles in the 
(n + 1 )$t generation is defined recursively by 

(6.2) P x (s) = P(«), Pn+iW = P(PnW). 

In example (a) P(s) = q + ps m ; and hence P 2 W = q + p(Q + ps m ) m , 
Ps(s) = q + p{q + p(q + ps m ) m \ m } etc. For a Poisson distribution 
P(s) = e“‘ x(1 ~ ,) , P 2 («) = e” x+Xe X+Xs , etc. These formulas are not 



226 GENERATING FUNCTIONS [11.6 

very pleasing, but (6.2) permits us to draw important general conclu¬ 
sions. 

We first inquire into the probability x n that the process terminates 
at or before the nth generation. This means that X n = 0, and hence 
x n = P n (0). It is clear from the definition that x n can only increase. 
As a matter of fact, we have x\ = P(0) = po, and x n +\ = P{x n ). 
Now P(s) is an increasing function, and therefore x 2 = P{x\) > P(0) 
= xi) by induction x n +i = P(x n ) > P(# n __i) = x n . It follows that 
x n —> £ where £ satisfies 

(6.3) f = p(f). 

We claim that f is the smallest root of (6.3). In fact, if rj is any other 
root, then, since P(s) is increasing, xi = P(0) < P(rj) = r\\ by induc¬ 
tion if x n < ri, then also x n+ i = P(x n ) < P(rj) = rj so that £ < rj, 
which proves that no smaller root than f can exist. 

We have now to investigate the roots of the equation s = P(s ). 
Clearly s = 1 is always a root. If there exist two roots Si and s 2) 
then the difference ratio {P(s 2 ) — P($i))/(s 2 s i) equals one, and by 
the mean value theorem there exists a point cr between Si and s 2 for 
which P'(<r) = 1. However, for 0 < s < 1 the function P'(s ) increases 
steadily, and hence there exists at most one value a between 0 and 1 
for which P'(cr) = 1. In other words, there can exist at most one pair 
of roots of s = P(s) lying within the interval 0 < s < 1. The end 
point s = 1 is always a root. For the existence of a second root s with 
0 < $ < 1 it is necessary that P'(a) = 1 for some value a < 1 and 
hence P'(l) > 1. On the other hand, if no root s < 1 exists, then 
Po = P(0) > 0 and hence P(s) > s for 0 < s < 1. Since P(l) = 1, 
it follows that P'(l) < 1. Thus a root s < 1 of s = P(s) exists if and 
only if P'(l) > 1; this root is unique. Now P'(l) = 'Xkpk is the 
expected number of direct descendants of a particle. We can, there¬ 
fore, formulate the fundamental result: 

Let /x = S kpk be the mean of the number of direct descendants of a 
single particle . If p < 1, then the probability tends to one that the process 
will terminate before the nth generation (that is, X n = 0). If p > 1, 
then there exists a unique root £ < 1 of (6.3), and f is the limit of the 
probability that the process terminates after finitely many generations . 

The difference 1 — £ can be called the probability of an infinitely 
prolonged process. Usually x n converges to £ rapidly, so that if the 
process terminates it is likely to proceed for only very few generations. 
In practice, therefore, £ is the probability of a rapid extinction. In 
the example ( c ) we may call 1 — f the probability that a mutant gene 



11.7] 


PARTIAL FRACTION EXPANSIONS 


227 


establishes itself. If we start with r particles instead of a single one, 
the probability that all r descendant lines die out is f r , and the prob¬ 
ability of at least one being successful is 1 — f r . Even if f is relatively 
large, 1 — f r is near 1 if the initial number r is large. In the nuclear 
chain reaction of example (a) this is always the case, and hence we 
can say: if /x > 1, the probability of an explosion is near 1, while for 
fi < 1 the probability is 1 that the process stops after a finite number 
of generations. 

We can also find the expected size of the nth generation E(X n ) 
= P'n(l). Since P n (s) = P(P n _x(«)), we find 

P' n (l) = P' (P n _i(1)) P*n —1 (1) = P'(l)P'n-l(l) = M l), 
and generally by induction 
(6.4) E(X n ) = M n 

Hence, if y > 1, we should expect an exponential growth. This argu¬ 
ment can be amplified. It is easily seen that not only P n (0) —> f but 
also P n (s) —> f for all s < 1. This means that the coefficients of 
s, s 2 , s 3 , ... tend to zero. After a large number of generations the 
'probability that no descendants exist is near f, and the probability that 
the number of descendants exceeds any preassigned bound is near 1 — f; 
it is exceedingly improbable to find a moderate number of descendants. 

We have found only the distribution of the individual variables X n . For a 
general theory we require also the joint distributions of all combinations (X\, X 2 , 
• • •, X n ), and these can be easily derived (cf. problem 21). The sequence { X n \ of 
(mutually dependent) random variables describes our stochastic process. The whole 
infinite sequence and probability relations in it must be considered if we desire 
properties of other random variables connected with the process, such as the 
lifetime (number of generations before extinction) in the case n < 1, etc. Further 
results concerning the behavior of X n are found in a paper by Harris, which, how¬ 
ever, uses deep complex variable methods. 7 

7. Partial Fraction Expansions 

It is frequently practically impossible to find explicit expressions 
for a required probability distribution but simple to find the 
corresponding generating function P(s). The chain reactions treated 
in the preceding section illustrate the fact that much useful information 
can be obtained directly from the generating function. Moreover, if 
the generating function is known, it is usually possible to derive from 
it simple approximations to the required probabilities. The exact ex¬ 
pressions for \pk) are in many cases so complicated that approximations 

7 T. E. Harris, Branching Processes, Annals of Mathematical Statistics , vol. 11 ) 
(1948), pp. 474-494. 



228 


GENERATING FUNCTIONS 


[11.7 


are preferable. Perhaps the most useful method is that of partial 
fractions which we proceed to describe in the simplest case. 

Suppose that the generating function is rational, that is, 


(7.1)' 


P(s) = 


U(s) 

V(s) 


where U(s) and V(s) are polynomials without common roots. For 
simplicity let us first assume that the degree of U(s) is lower than the 
degree of V(s). Assume that F(s) is of degree m and that its m (real 
or imaginary) roots s Jy • • •, s m are all distinct. Then 

(7.2) V(s ) = (s - $i)(s - s 2 ) • • • (s - s m ), 

and it is known from algebra that P(s) can be decomposed into 'partial 
fractions 

(7.3) P(s) =+ + 

'* S 82 S Sm S 


where pi, p 2 , • • •, p m are constants. The roots S\, s 2 , • • •, s m are found 
by solving the equation V (s ) = 0. To find the constant pi we multiply 

(7.3) by si — s; we see that as s —> si the product (s x — s)P(s) tends 
to pi. On the other hand, from (7.1) and (7.2) we get 


(7.4) 


( Sl - s)P(s) = 


U(s) 


(s - s 2 )(s - 


(s Sm) 


As s —> Si the numerator tends to — («i) and the denominator to 

(si — s 2 )(si — s 3 ) • • • (si — s m ), which is the same as F'( s i)- Thus 
Pi = — U(si)/V'(si). The same argument applies to all roots, so that 
for k < m 


(7.5) 


-U(s k ) 
V\s k ) ' 


Unfortunately, extensive numerical calculation is usually required 
to put (7.1) into the form (7.3). However, once the expansion (7.3) 
is obtained, we can easily derive an exact expression for the coefficient 
of s n in P(s). Note that we can write 


(7.6) 


1 _ 1 1 

S]c — S S k 1 ~ S/Sjc 


For | 8 | < | s k | we can expand the last fraction into a geometric series 
(7.7) —1— = 1+ - + (-Y + (-)V.... 

1 ~ 8/8 k 8 k \S k / \s k / 



11.7] 


PARTIAL FRACTION EXPANSIONS 


229 


Introducing these expressions into (7.3), we find for the coefficient p n 
ofs n 


(7.8) 


Pl P2 Pm 

o n +l « »+l <? n+1 


Thus, to get p n we have first to find the roots «i, • * •, of the 

denominator and then to determine the coefficients p u • • •, p m from 
(7.5). 

In (7.8) we have an ezactf expression for the probability p n . The 
labor involved in calculating all m roots is usually prohibitive, and 
therefore formula (7.8) is primarily of theoretical interest. Fortunately 
a single term in (7.8) almost always provides a satisfactory approxima¬ 
tion. In fact, suppose that Si is a root which is smaller in absolute value 
than all other roots. Then the first denominator in (7.8) is smallest. 
Clearly, as n increases, the proportionate contributions of the other 
terms decrease and the first term preponderates. In other words, if 
is a root of V(s) = 0 which is smaller in absolute value than all other 
roots , then , as n —> oo, 

( 7 . 9 ) 


(where the sign ~ indicates that the ratio of the two sides tends to 1). 
Usually this formula provides surprisingly good approximations even 
for relatively small values of n. The main advantage of (7.9) lies in 
the fact that it requires the computation of only one root of an algebraic 
equation. 

Examples, (a) Let q n be the probability that in n tosses of an ideal 
coin no run of three consecutive heads appears [note that {^ n J is not a 
probability distribution; if p n is the probability that the first run of 
three consecutive heads ends at the nth trial, then {p w j is a probability 
distribution, and q n represents its “tails,” q n = p n +i + Pn+ 2 + . ..]. 

We can easily show that q n satisfies the recurrence formula 

(7.10) q n = \q n -1 + T<?n-2 + itfn-3- 

In fact, the event that n trials produce no sequence HHH can occur 
only when the trials begin with T, HT, or HHT. The probabilities 
that the following trials lead to no run HHH are q n ~\, q n - 2 , and g n _ 3 , 
respectively, and the right side of (7.10) therefore contains the prob¬ 
abilities of the three mutually exclusive ways in which the event “no 
run HHH ” can occur. 

Evidently g 0 = q\ = Q 2 = 1> and hence the q n can be calculated 
successively from (7.10). To obtain the generating function Q(s) 



230 GENERATING FUNCTIONS ' [11.7 

= 2 q n s n we multiply both sides by s n and add for n = 3,4, 5, .... We 
get 


Q(«) - 1 


(7.11) 
or 

(7.12) 


— s 


= - \Q(s) -l-s}+- *Q(«) - 1} + - Q(s) 
2 4 8 


Q(s) « 


2s 2 + 4s + 8 
8 - 4s - 2s 2 - 


The denominator has the real root s t = 1.0873778... and two complex 
roots. For | s | < Si we have 

| 4s + 2s 2 + s 3 | < 4sj + 2sj 2 + S! 3 = 8, 

and the same inequality holds also when | s | = Si unless s = s ( . Hence 
the other two foots exceed si in absolute value. Thus, from (7.9) 

1.236840 

(7 ' 13) 9n ~ (1.0873778)" +1 ’ 

where the numerator equals (2s! 2 + 4si + 8)/(4 + 4 Si + 3«i 2 ). Table 
1 shows that (7.13) gives excellent approximations for all n in question. 
(The general theory of runs is developed in Chapter 13, section 1.) 


TABLE 1 

Illustration to Partial Fraction Expansions 

Qn r n 


n 

True 

Approximate 

True 

Approximate 

3 

0.875 

0.884 69 

0.875 

0.856 07 

4 

.8125 

.813 60 

.750 

.751 15 

5 

.75 

.748 22 

.656 25 

.659 09 

6 

.6875 

.688 10 

.578 12 

.578 31 

7 

.632 81 

.632 80 

.507 81 

.507 43 

8 

.582 03 

.581 95 

.445 31 

.445 24 

9 

.535 16 

.535 19 

.390 63 

.390 67 

10 

.492 19 

.492 18 

.342 77 

.342 79 

11 

.452 64 

.452 63 

.300 78 

.300 78 

12 

.416 26 

.416 26 

.263 91 

.263 91 


q n is the probability that in n tosses of a coin the sequence HHH does not appear, 
and r n the corresponding probability for the sequence HTH. The true values are 
calculated from (7.10) and (7.14), the approximations from (7.13) and (7.16). 


(6) Let now r n be the probability that in n tossings of a coin the 
sequence HTH does not appear. This example is similar to the 



11.7] 


PARTIAL FRACTION EXPANSIONS 


231 


preceding one, but note that r n is not the same as the probability q n 
that no run HHH appears. In this case it is no longer easy to obtain 
a recursion formula analogous to (7.10): the general theory of Chapter 
13 will show that 

(7.14) r„ = r n _i - \r n _ 2 + 3 , r 0 = n = r 2 = 1. 


(The reader may try to verify this formula: cf. problem 2 of Chapter 
13.) Proceeding as before, we find for the generating function 


(7.15) 


m = 


8 + 2s 2 

8 - 80 + 2s 2 - s 3 


and hence 


(7.16) r n ~ 1.444248(1.139680) —n—1 , 

approximately. (Cf. Table 1.) 

( c ) Another numerical example will be found in Chapter 13, section 5. 

It is easy to remove the restrictions under which we have derived 
the asymptotic formula (7.9). To begin with, the degree of the numer¬ 
ator in (7.1) may exceed the degree m of the denominator. Let U(s) 
be of degree rn + r (r > 0); a division reduces P(s) to a polynomial of 
degree r plus a fraction Ui(s)/V($) in which Ui(s) is a polynomial of a 
degree lower than rn. The polynomial affects only the first r + 1 terms 
of the distribution {p n }, and Ui(s)/V(s) can be expanded into partial 
fractions as explained above. Thus (7.9) remains true. Secondly, the 
restriction that V(s) should have only simple roots is unnecessary. It 
is known from algebra that every rational function admits of an expan¬ 
sion into partial fractions. If Sk is a double root of F(s), then the partial 
fraction expansion (7.3) will contain an additional term of the 
form a/($ — Sfc) 2 , and this will contribute a term of the form 
a{n + l)s/T (n+2) to the exact expression (7.8) for p n . However, this 
does not affect the asymptotic expansion (7.9), provided only that Si 
is a simple root. We note this result for future reference as a 

Theorem. If P(s ) is a rational function with a root of the denomi¬ 
nator which is smallest in absolute value and is a simple root , then the 
coefficient p n of s n is given asymptotically by p n ~ PiSi“ (n+1) , where pi is 
defined in (7.5). 

A similar asymptotic expansion exists also in the case where $i is 
a multiple root. Finally, the restriction to rational functions was 
introduced only for convenience. It is known from the theory of 
complex variables that a much more general class of functions admits 
of partial fraction expansions, and this is one of the sources of usefulness 



232 GENERATING FUNCTIONS [11.8 

of generating functions and characteristic functions in general. (Cf. 
also problem 22.) 

* 8. The Continuity Theorem 

We know from Chapter 6 that the Poisson distribution {e~ x xV&!} 
is the limiting form of the binomial distribution with the probability p 
depending on n in such a way that np —+\ as n —»<*>. Then 
b(k ; n, p) —> e~*\ k /k\ The generating function of {b(k; n, p)}, is 
(<q + ps) n = {1 — X(1 — s)/n} n . Taking logarithms, we see directly 
that this generating function tends to e~ x(1 "" a) , which is the generating 
function of the Poisson distribution. We now show that this situation 
prevails in general; a sequence of probability distributions converges 
to a limiting distribution if and only if the corresponding generating 
functions converge. Unfortunately, this theorem is of limited applica¬ 
bility, since the most interesting limiting forms of discrete distributions 
are continuous distributions (for example, the normal distribution ap¬ 
pears as a limiting form of the binomial distribution). 

Continuity Theorem. Suppose that for every fixed n the sequence 
a 0 , n , «i.n, « 2 ,nj • • • a probability distribution , that is , 

00 t 

(8.1) Q'k.n > Clk.n = 1* 

k=0 

In order that for every fixed k 

(8.2) a ktH -> a k 

as n co it is necessary and sufficient that for every s with 0 < s < 1 

(8.3) A n (s)-+A(s) 
where 

oo oo 

(8.4) A n (s) = 2 ak.nS k , A(s) = £ a k s k 

k*~ 0 kemO 

are the corresponding generating functions. 

Note. If (8.2) holds, then automatically 0 < a k < 1 and 2a& < 1. 
The generating function A (s) exists therefore at least for | $ | <1. 
However, the limiting sequence {a k \ is not necessarily a probability 
distribution; for example, if the first n terms of the distribution \a k , n ) 
vanish, then the limiting sequence vanishes identically. For {a k \ to 

* Starred sections treat special topics and may be omitted at first reading. 



11 . 8 ] 


THE CONTINUITY THEOREM 


233 


be a probability distribution it is necessary and sufficient that 2a& = 1 
or A(l) = 1. 

Proof * First, suppose that (8.2) holds. For fixed s (0 < s < 1) and 
fixed e we can choose r so that s r /(l — s) < e. Then 

(8.5) | A n (s) - A(s) | < E | a k , n - a k |s* + 2«. 

fc=o 

The sum on the right contains only finitely many terms each of which 
tends to zero. Hence | A n (s) — A(s) | is arbitrarily small for n suffi¬ 
ciently large. Next, assume that (8.3) holds. We use the well-known 
fact 9 that it is always possible to find a subsequence {a*, n } of the given 
sequence of distributions which converges. If (8.2) were not true, 
then it would be possible to extract two subsequences converging to 
two different limiting sequences {ak*} and and the correspond¬ 

ing subsequences of {A n (s)} would conyerge to A*(s) = 2afc*s fc and 
A**(s) = 2 ak**s k , respectively. However, this is impossible in view 
of the assumption (8.3). Therefore (8.3) implies (8.2). 

Examples, (a) The Pascal Distribution. We saw in section 3 that 
the generating function of the Pascal distribution {/(fc;r, p)} is given 
by (p/1 — qs) r . Now let X be fixed, and let p —> 1, q— >0, so that 
q — \/r (r —► oo). Then 



Passing to logarithms, we see that the right side tends to e” x+Xs , which 
is the generating function of the Poisson distribution {g“" x xV&!}. 
Hence if r —» «> and rq —> X, then 

x * 

(8.7) f(k;r, p) -» e _x —• 

fc! 

Note that this is a new limit theorem: in section 4 we found another 
limit for the case where p —> 0 and q —> 1, but r remained constant. 

(b) Poisson Trials . This name is customary for the following 
generalization of the scheme of Bernoulli trials. Each of n mutually 

8 The theorem is a special case of the continuity theorem for Laplace-Stieltjes 
transforms, and the proof follows the general pattern. In the literature the conti¬ 
nuity theorem for generating functions is usually stated and proved under unneces¬ 
sary restrictions. 

9 This is easily established by the “method of diagonals’ ? due to G. Cantor and 
found in all books on set theory. The statement is, incidentally, a special case of 
a well-known theorem of Helly. 



234 


GENERATING FUNCTIONS 


( 11.8 


independent trials results in success (S) or failure (F); the corresponding 
probabilities in the kth trial are pk nnd qk (pk + qk = 1)- The number 
S n of successes in n trials is a random variable which can be written 
as the sum 

(8.8) S n = X 1 + • • • + X n 

of n mutually independent random variables Xk with the distributions 

(8.9) Pr{X k = 0} = q k , Pr{X k = 1} = p k . 

The generating function of X k is q k + p k s , and hence the generating 
function of S n 

(8.10) P(s) = (q x + pis)(q 2 + p 2 s ) • • • (q n + p n s ). 

Clearly 

(8.11) E(S n ) = Pi + P 2 + # * + Vnj 

Var(S n ) = p x q x + p 2 q 2 + - • • + p n q n - 

As an application of this scheme let us assume that each house in a 
city has a small probability p k of burning on a given day. The sum 

pi H-h p n is the expected number of fires in the city, n being the 

number of houses. We have seen in Chapter 6 that if all p k are equal 
and if the houses are statistically independent, then the number of 
fires is a random variable whose distribution is near the Poisson dis¬ 
tribution. We show now that this conclusion remains valid also under 
the more realistic assumption that the probabilities p k are not equal. 
This result should increase our confidence in the Poisson distribution 
as an adequate description of phenomena which are the cumulative 
effect of many improbable events (“successes”). Examples are acci¬ 
dents and telephone calls. 

We use the now-familiar model of an increasing number n of variables 
where the probabilities p k depend on n in such a way that the largest p k 
tends to zero, but the sum 

(8.12) Pi + P 2 + • • * + Pn = X 
remains constant. Then from (8.10) 

(8.13) log P(e) = £ log {1 - p*( 1 - s)}. 

i 

Since p* —»0, we can use the fact that log (1 — z) = — x — Qx, where 
0 —»0 as x —* 0. It follows that 



11.9] 


PROBLEMS FOR SOLUTION 


235 


(8.14) logP(s) = — (1 — «) | 23 (Vk + hVk) j — X(1 — s), 

so that P(s ) tends to the generating function of the Poisson distribu¬ 
tion. Hence, S n has in the limit a Poisson distribution. We conclude 
that for large n and moderate values of X = p x + p 2 + • • • + p n the 
distribution of S n can be approximated by a Poisson distribution. 
Estimates of the error involved can be derived, but we shall not go 
into such details. (Cf. also problem 20 of Chapter 9.) 

9. Problems for Solution 

It is understood that the random variables occurring in the following 
problems assume only non-negative integral values . 

1. Let A' be a random variable with generating function P(s). Find the generat¬ 
ing functions of X -f- 1 and 2X. 

2. Continuation. Find the generat ing functions of (a) Pr { X < n j, (6) Pr { X < n |, 
(c) Pr\X > n\, (d) Pr\X > n + 1], (e) Pr\X = 2n\. 

3. In a sequence of Bernoulli trials let u n be the probability that the first com¬ 
bination SF occurs at trials number n — 1 and n. Find the generating function, 
mean, and variance. 

4. In a sequence of n Bernoulli trials let u n be the probability of an even num¬ 

ber of successes. Prove the recursion formula u n — qu n ~ i + p( 1 — w n -i). From 
it derive the generating function and hence an explicit formula for u n . [Note 
that this formula is considerably simpler than the obvious formula u n = 6(0; n, p) -f 
6(2; n, p) H-.] 

5. In the sampling problem (3 .d) of Chapter 9, find the generating function of 
the variable S r (r fixed). Verify formulas (3.5) and (5.9) of Chapter 9 for the mean 
and variance. 

6. Continuation. The following is an alternative method for deriving the same 
result. Let p n (f) = Pr { S r = n ). Prove the recursion formula 

r - 1 N - r - M 

(9.1) Pn+lM - — Pn ( r ) H- — - Pn(r - 1). 

Derive the generating function directly from (9.1). 

7. Solve the two preceding problems for r preassigned elements (instead of r 
arbitrary ones). 

8. 10 Let the sequence of Bernoulli trials up to the first failure be called a turn. 

10 Problems 8-10 have a direct bearing on the game of billiards. The probability 
p of success is a measure of the player's skill. The player continues to play until 
he fails. Hence the number of successes he accumulates is the length of his “turn.” 
The game continues until one player has scored N successes. Problem 8 therefore 
gives the probability distribution of the number of turns one player needs to score 
k successes, problem 9 the average duration, and problem 10 the probability of a 
tie between two players. For further details cf. O. Bottema and S. C. \an Veen, 
Kansbcrekningen bij het biljartspel, Nieuw Archief voor Wiskunde (in Dutch), 
vol. 22 (1943), pp. 16-33 and 123-158. 



236 GENERATING FUNCTIONS [11.9 

Find the generating function and the probability distribution of the accumulated 
number S r of successes in r turns. 

9. 10 Continuation. Let R be the number of successive turns up to the vth 
success (that is, the vth success occurs during the Rth turn). Prove that Pr {1? =* r ( 

- pY ~ 1 ( r tlT 2 )' Find E(K) and Var(i?) - 

10. 10 Continuation . Consider two sequences of Bernoulli trials with probabilities 
Ph Qh and P 2 , q 2 , respectively. Show that the probability that the same number 
of turns will lead to the 2Vth success can be exhibited in either of the forms: 

00 /AT 4- u — 9\ 2 /N — 1 \ 2 

(pipi) N X) ( v j ) (9i92)' _1 = (piP 2 )^(l - Qig 2 )'~' 2N ( k ) (gm) k - 


11. Let \Xk\ be mutually independent variables, each assuming the values 

0, 1, 2, • • *, a — 1 with probabilities 1/a. Let S n = ATi -b X n . Show that 

the generating function of S n is 


P(s) 


i 1 -» ] 

la(l — s) 1 


and hence 


a n v =o \y/ \J — o>v/ 


(Only finitely many terms in the sum are different from zero.) 

Note: For a — 6 we get the probability of scoring the sum j + n in a throw with 
n dice. The solution goes back to DeMoivre. 

12. Continuation. The probability Pr j S n < j\ has the generating function 
P(«)/(! — s) and hence +H 




13. Continuation: the limiting form. If a —> w and j —* «>, so that j/a —> x, 
then 

Pr\S n < j) -> i £ (-1/ (”) (x - y) n , 


the summation extending over all v with 0 < v < x. 

Note: This result is due to Laplace. In the theory of geometric probabilities the 
right-hand side represents the distribution function of the sum of n independent 
random variables with “uniform” distribution in the interval (0, 1). 

14. Let u n be the probability that the number of successes in n Bernoulli trials 
is divisible by 3. Find a recurrence relation for u n and hence the generating func¬ 
tion. 

15. Continuation: alternative method. Let v n and w n be the probabilities that Sn 
is of the form Zv + 1 and Zv 4* 2, respectively (so that u n + Vn + % « 1). Find 
three simultaneous recurrence relations and hence three equations for the generat¬ 
ing functions. 

16. Let X and Y be independent variables with generating functions U(s) and 
y(s). Show that Pr{X — Y = j) is the coefficient of s 3 in U(s)V(l/s), where.; ® 0, 
dbl, =fc2, .... 

17. Moment generating functions. Let AT be a random variable with generating 
function P(s), and suppose that 2p n s n converges for some «o > 1. Then all mo- 



11.9] 


PROBLEMS FOR SOLUTION 


237 


ments m r * EiJF) exist, and the generating function F(s) of the sequence m r /r! 
converges at least for | s | < log s 0 . Moreover 

F(s) = = P(e'). 

r=o r! 

Note: F(s) is usually called the moment generating Junction, although in reality it 
generates m r /r!. 

18. Prove the following formula for the 1 ‘tails” of the Pascal distribution 

£ /(*; r,p)='£ f(k; n - r,q) 

k**n fc— 0 

Note: No computations are necessary. 

19. Compound Poisson distribution. 11 Let the random variable X assume the 
values Xi, X 2 , ... with probabilities 7 / 1 , 7 / 2 , . .. (Xy > 0, 2 17 / = 1). Show that 

Vk = E Nje -X ’'(X;)* 

k\ j^r 0 

is a probability distribution. Find the generating function and prove that its 
mean equals E{X) and that its variance equals Vnr(X) + E{X). 

20. In the chain reaction problem calculate V&r(X n ). 

21. Continuation. If n > m, show that E(X m X n ) = n n ~ tn E(Xm 1 ). 

22 . Suppose that A(s) = 2a„s w is a rational function U(s)/V(s) and that s\ 
is a root of V(s ), which is smaller in absolute value than all other roots. If si is of 
multiplicity r , show that 

Pi (n + r - 1\ 

Bn ~«l* +r \ r — 1 ) 

where pi = — r/C/(«i)/F <r) (#i). 

11 The word “compound” is used here in a slightly more general sense than in 
section 5. Such distributions are of great importance. A typical interpretation is 
as follows. The number of raisins in a cake may be assumed to have a Poisson 
distribution (Chapter 6, end of section 5). Different kinds of cake are characterized 
by different values of the mean. If these varieties are mixed in the proportions 
u\'.uiluz \..., then p* is the probability of finding exactly k raisins in a cake if se¬ 
lected at random. 



CHAPTER 12 


RECURRENT EVENTS: 
THEORY 


1 . Definition 

We consider a sequence of repeated trials with possible results 
Ej (j = 1, 2 , ...). In all examples of this chapter the trials will be 
independent, but the theorems will be formulated so as to apply also 
to dependent trials and in particular to Markov chains (Chapter 15 ). 
As usual, we suppose that it is in principle possible to continue the 
trials indefinitely and that the probabilities Pr{(E jv • • •, E Jn )} of the 
outcomes of the first n trials are defined consistently for all n. 
We shall investigate classes of events defined by certain repetitive 
patterns. Roughly speaking, a pattern 8 qualifies for our theory if it 
has the following two properties. For every possible outcome of n 
trials (Ej V Ej V • • •, Ej n ) it can be uniquely determined whether 8 has 
occurred and, if so, at which trial or trials. Moreover, every time 8 
occurs the trials start from scratch in the sense that the trials following 
an occurrence of 8 are an exact replica of the entire sequence of trials. 
Before giving a rigorous definition we shall illustrate the notion by a 
few simple 

Examples, (a) Return to Equilibrium in Coin Tossing. In a sequence 
of independent tossings of a symmetric coin let 8 stand as an abbrevi¬ 
ation for “the accumulated numbers of heads and tails are equal.” 
For the time-honored bettor who at each trial loses or gains a unit 
amount, the occurrence of 8 means that his accumulated gain is zero. 
Here it is clear that the occurrence of 8 means a return to the initial 
situation or, as we prefer to call it, a return to equilibrium. Clearly 8 
can occur only at an even number of trials. If u n is the probability 
that 8 occurs at the nth trial, then u\ = us = U5 = uq = ... = 0; for 
8 to occur at the 2nth trial it is necessary and sufficient that the 2n 
trials produce n heads and n tails, and hence 



238 



12.1] 


DEFINITION 


239 


If the coin is tossed until 8 occurs for the first time, we get the follow¬ 
ing sequences (arranged according to length): HT , TH; HHTT ) 
TTHH; HHHTTTj HHTHTT , TTTHHH , TTHTHH; HHHHTTTT , 
HHHTHTTT , HHIITTHTT, IMTHHTTT, HHTHTHTT, TTTT- 
HHHH , TTTIITHIIH , TTTHHTHH, TTHTTHHH, TTHTHTHH; 
etc. If / n is the probability that 8 occurs /or /ir$£ fr’mc at the nth 
trial, then obviously f n = 0 whenever n is odd. An inspection of the 
above sequence shows that/ 2 = 1/2, / 4 = 1/8,/ 6 = 1 / 16 , / 8 = 5 / 128 . 
A further enumeration leads to the values /io = 7 / 256 , / i2 = 21 / 1024 , 
/14 = 33 / 2048 , .... The general formula is not discernible, but will 
be deduced later [cf. ( 3 . 19 )]. 

(6) Success Runs . We consider Bernoulli trials with probability p 
for success. Let 8 stand for “three consecutive successes.” This 
description is insufficient inasmuch as it does not tell whether in an 
uninterrupted sequence of five successes 8 occurs once, three times, or 
not at all. For our purposes we must define 8 so that the trials start 
from scratch whenever 8 occurs. This means that we must count 
only non-overlapping runs of three successes each: in n trials 8 occurs 
as often as there are non-overlapping runs of exactly three successes. 
In other words, 8 occurs for the first time when for the first time three 
successes appear in succession, and then the counting starts anew. 
Thus, in the sequence SSSSSSSSF the recurrent event 8 occurs at the 
third and the sixth places. If f n is the probability that 8 occurs for 
the first time at the nth trial, then /i = / 2 = 0, fs = p 3 , / 4 = f 5 = 

= QP 3 , h - (1 - P 3 )QP 3 , fa = (1 ~ P 3 ~ qp 3 )qp 3 , ■■■■ The general 
theory of success runs and similar repetitive patterns will be taken up 
in Chapter 13 . 

(c) Consider consecutive tossings of a symmetric die and let 8 stand 
for “ones, twos, • • •, sixes appeared in equal numbers.” This is similar 
to example (a), but, as we shall see, there is one important difference. 
In example (a) we have /i + /2 + h + • • • = 1 , which may be inter¬ 
preted by saying that 8 is certain to occur sooner or later. In the pres¬ 
ent case this is not so. In fact, it will be found that the probability 
fi + h + h + • • • that 8 ever occurs is only about 0.022. [Cf. exam¬ 
ple ( 3 d).] 

Our examples show that it would be most natural to consider the 
sample space of infinite sequences of trials [Chapter 8, section 1 ]. 
However, this is not necessary, and we may study recurrent events also 
in spaces of finitely many trials. The essential property of recurrent 
events is that after every occurrence of 8 the trials start from scratch. 
This means that all events preceding an occurrence of 8 should be 



240 RECURRENT EVENTS [ 12.1 

statistically independent of all subsequent events. A more precise 
description is contained in the 

Definition. We speak of a recurrent event 8 if the following conditions 
are satisfied. 

( 1 ) There exists a rule which for every sequence (E jv E j2 , • • •, Ej n ) of 
possible outcomes uniquely determines whether or not 8 occurs at the last 
trial. This rule depends on the sequence and not on the following trials. 

(2) Let (Ej V Ej 2 , • • •, Ej m ) be a sequence in which 8 occurs at the last 
trial and let (E kl , • • •, E kn ) be an arbitrary sequence. Then 8 occurs at 
the last [or (m + n)th\ trial of the combined sequence (Ej v • • Ej m , E kl , 
• • •, Ek n ) if and only if it occurs at the last [or nth] trial of (E kv • • •, 

EuJ- 

( 3 ) In this case 

^ Er[(Ej x * • *, Ej m , E kv • • Ej^)] 

= Pr{(E JV • • •, Ej m )}Pr{(E kl , }. 

Given any particular sequence (E h , E jv •••, E Jn ) of possible out¬ 
comes, we can consider the n subsequences (EjJ, (E jv E h ), 
(E jv E h , EjX •, (E iv • • E jn ) and mark those in which 8 occurs at 
the last trial. If this is true of k among these subsequences, then we 
say that 8 occurs in (E jv • • •, E jn ) exactly k times. Thus, in example 
(l.a), 8 occurs in the sequence HHTIITTHTTI 1 T three times, namely, 
at trials number 6, 8, and 10. 

Condition ( 3 ) implies the following property: if 8 occurs at trial 
number m, then the conditional probability that it occurs again at 
the trial number m + n equals the probability that 8 occurs at the nth 
trial. 

Example, (d) Consider a sequence of Bernoulli trials and suppose 
that in the first trials successes alternate with failures. It is obviously 
legitimate to say that in the sequence SFSFSFSFSFSF the pattern 
SFS occurred at trials number 3 , 5 , 7 , 9 , and 11 . However, if we wish 
to define a recurrent event 8 characterized by the pattern SFS, then 
the counting is regulated by condition (2). We are required to con¬ 
sider all subsequences, and since the third trial completes the first 
pattern SFS , there is no doubt that 8 occurs at the third trial. Now 
condition (2) starts to operate. According to it, 8 occurs at trial 
number n + 3 if and only if it occurs at the nth trial of the reduced 
sequence FSFSFSFSF (in which the first three trials are omitted). 
Since here 8 occurs for the first time at the fourth trial it occurs in the 
original sequence for the second time only at the seventh trial. This 



12.2] RECURRENCE TIMES 241 

implies that the third occurrence of 8 takes place at the eleventh trial, 
etc. 

Consider now the sample space of n trials, that is, the aggregate of 
all sequences (E JV E iv • • *, E jn ). In it we may consider an event such 
as “8 occurs exactly three times”; it consists of those sample points 
(Ej v • • •, Ej n ) in which 8 occurs exactly three times. Similarly, events 
such as “8 occurs for the first time at the third trial,” “8 occurs an even 
number of times,” etc., are well defined. 

In particular, we may talk of the probability f k that 8 occurs for 
the first time at the fcth trial (fc < n). Then /i 4-/2 + • • • + /& is 
the probability that 8 occurs in the first k trials. If = 1, this 
probability tends to one, and we call 8 certain; if 2/ fc < 1, 8 will be 
called uncertain. [In examples (l.a) and (1.6) 8 is certain, in (l.c) 
uncertain.] Finally we shall say that 8 has 'period t > 1 if it can occur 
only at trials number t,2t,3t, • • •, (t > 1). In example (1 .a) 8 can occur 
only at even trials, hence t — 2; similarly, in example (l.c) t = 6. We 
summarize these ideas in the following definition. 

Classification. Let fk be the probability that 8 occurs for the first time 
at the kth trial. The recurrent event 8 is called certain or uncertain 
according as 2/fc = 1 or 2/^ < 1. Let t be the greatest integer such that 
8 can occur only at trials number t y 2 1, 3 1 } • • • (then f k = 0 whenever k is 
not divisible by t). We say that 8 has the period t if t > 1. If t = 1 
then 8 is called aperiodic. 

2. Recurrence Times 

Suppose that 8 is a certain recurrent event, that is, let 

(2.1) 2/ ; = 1. 

In the preceding section we have considered the number of trials up 
to and including the first occurrence of 8 . This is obviously a random 
variable X\ which is completely defined only in the sample space of 
infinite sequences of trials. However, it is clear that its probability 
distribution must be given by 

(2-2) Pr\X t =j] = 

More generally, consider the random variable X k defined as the number 
of trials following the (k — 1 )st occurrence of 8 up to and inchuling the 
kth occurrence. This is the kth recurrence time of 8. Again, this random 
variable is defined completely only in the sample space of infinite 
sequences of trials. We could effect a simple limiting process, but it 
is clear from the very definition of recurrent events that all variables 



242 


RECURRENT EVENTS 


[12.2 


Xk have the common distribution (2.2) and that they are mutually 
independent. We can avoid any further reference to infinite sample 
space if we accept the following fact. 

With every certain recurrent event 8 there is associated an infinite 
sequence of mutually independent random variables Xk, the recurrence 
times . They have the common distribution {fj}, and 

(2.3) S n = X x + ... +X n 

is to be interpreted as the number of trials up to and including the nth 
occurrence of 8. 

We put 

(2.4) M = £ jfi 

l 

and call v the mean recurrence time of 8. Clearly E(X k ) = m- Note that 
the mean recurrence time may be infinite; in fact, in the most interest¬ 
ing physical applications /x is infinite. 

In the case of an uncertain 8 put 

(2.5) / = 

so that/ < 1. Now {fj} is no longer a probability distribution. How¬ 
ever, all preceding remarks remain valid if we agree to describe a non- 
occurrenee of 8 by saying that the recurrence time is infinite. This 
means that the recurrence times Xk now are generalized random variables 
which (with probability 1 — /) assume the (improper) value oo. Thus 
Xk — j with probability fj and X k = with probability 1 — the 
latter is an abbreviation for saying that there is probability 1 — / that 
the (k — l)st occurrence of 8 (if any) is not followed by a kth occur¬ 
rence. The probability distribution of S n is obtained in the customary 
way, adding the rule “infinity plus any value equals infinity.” (This 
is possible since we are adding positive variables and no subtractions 
occur.) 

The probability that 8 does not occur at all is 1 — /. The probability 
that it occurs once and only once is (because of the independence of 
recurrence times) /(I — /). Proceeding in this way, we find/ n (l — /) 
for the probability that 8 occurs exactly n times. The probability 
that 8 occurs at most n times is therefore 

(2.6) (1 + f +•..+/») = l-/»+ x . 

As n —» oo the right side tends to 1. This fact can be described by 
saying that with probability one an uncertain recurrent event occurs only a 
finite number of times , while a certain 8 is bound to occur infinitely often . 



12.3] 


FUNDAMENTAL THEOREMS 


243 


The first half of this statement is the correct interpretation of (2.6) 
in the sample space of infinitely many trials. 

3. Fundamental Theorems 

We keep the notation // for the probability that 8 occurs for the 
first time at trial number j. At the same time we shall study the 
probability Uj that 8 occurs at the jth trial (not necessarily for the first 
time). The quantities u n and f n are related by the fundamental 
equation 

(3.1) U n = fn + /n- \U\ + /n — 2 U 2 + * * ‘ + f2 u n-2 +/l^n-l- 

In fact, if 8 occurs at the nth trial, then either it occurs for the first 
time (probability / n ), or it occurred at some previous trial. The 
probability that 8 occurs for the first time at trial number n — v and 
again at the nth trial is obviously f n - v u v . These events are mutually 
exclusive, and therefore (3.1) holds. 

Equation (3.1) is a recurrence relation from which the u n can be 
calculated successively if the fj are known and vice versa. It assumes 
a more symmetric form if we agree to define 

(3.2) n 0 = 1, / 0 = 0. 

With this definition (3.1) becomes 

(3.3) U n = no/n “f" ^l/n—1 "f" ^ 2 /n—2 “f“ * * * "f“ ^n— lfly 

but this equation is valid only for n > 1. The right side is the convolu¬ 
tion of the two sequences {fj\ and {u ; }, and this suggests introducing 
the generating functions 

00 00 

(3.4) U(s) = Ey, F(s) = E/y. 

j=o i =1 

The generating function of the right side in (3.3) is U(s)F(s). (Cf. 
Chapter 11, section 2.) On the left we have the sequence {w n } with 
the term n 0 = 1 missing, so that the generating function of the left 
side is U(s) — 1. Thus U(s) — 1 = U(s)F(s), and we have the 

Theorem 1. For k > 1 let fk be the probability that 8 occurs for the 
first time at the kth trial , and Uk the probability that 8 occurs at the kth 
trial; for k = 0 put fo = 0 and u 0 = 1. The generating functions (3.4) 
arc then related by the identity 

(35) ^ ^ 



244 


RECURRENT EVENTS 


[12.3 


Theorem 2. The recurrent event 8 is uncertain if and only if 

oo 

(3.6) u = J) uj 

j=o 

is finite. In this case the probability f that 8 ever occurs is given by 

u — 1 


(3.7) 


/ = 


u 


Note. We can interpret uj as the expectation of a random variable 
which equals 1 or 0 according to whether 8 does or does not occur at 
the jth trial. Hence u x + u 2 + • • • 4* u n is the expected number of 
occurrences of 8 in n trials, and u — 1 can be interpreted as the ex¬ 
pected number of occurrences of 8 in infinitely many trials. 

Proof. If 8 is uncertain, then the series for F(s) converges at s = 1 
and F( 1) = / < 1. Now the series for U(s ) has only non-negative 
coefficients, and since U(s ) approaches 1/(1 — /) as s —> 1 it follows 
from Abel's theorem that the series converges for s — 1 and (7(1) 
= 1/(1 — /). Conversely, if 8 is certain, then F{s ) —> 1 as s —> 1, and 
hence U(s) —> °o so that Xuj diverges. 

We now come to the main result of the theory from which we shall 
derive, among others, the ergodic properties of Markov chains. The 
proof 1 is of an elementary nature, but since it does not contribute to 
an understanding of the applications of the theorem we defer it to 
section 7 (cf. also problem 1). 

Theorem 8. If 8 is a certain recurrent event and not periodic , then 

1 

(3.8) u n > — 

where p = 2n/ n is the mean recurrence time (u n —> 0 if the mean recur¬ 
rence time is infinite). 

Theorem 4. If 8 is certain and has period X > 1, then as n —> «> 

(3.9) u n \ ► —, 

while Uk = 0 for every k which is not divisible by X. 

Proof. If 8 has period X, then only powers of s x appear in F(s), and 
F(s 1/X ) = F x (s) is again a power series. Since F x (l) = 1, we may 

1 P. Erdos, W. Feller, and H. Pollard, A Theorem on Power Series, Bulletin of 
the American Mathematical Society , vol. 55, pp. 201-204 (1949). 



12.3] 


FUNDAMENTAL THEOREMS 


‘245 


consider Fi(s) as the generating function of a recurrent event to 
which theorem 3 applies. It follows that the coefficients of Ui(s) 
= 1/{1 — Fi(s)} tend to l/nu where 

(3.10) Mi = /Y(l) = -If'(I) - ~ 

A A 

(Clearly p and pi are either both finite or both infinite.) However, 
the coefficient of s n in Ui(s) is the coefficient of s nX in U(s ), and this 
proves (3.9). 

Examples, (a) Consider a sequence of Bernoulli trials and let 8 
stand for “success.” Then u n = p when n > 1. Theorem 3 states 
that the mean recurrence time for successes is p = 1/p. This can be 
verified directly. The probability that the first success occurs at the 
nth trial is f n = q n ~ 1 p J and the mean of this geometric distribution 
already has been found to be 1/p. In order to verify the fundamental 
identity (3.5) note that we have F(s) = ps(l + qs + q 2 s 2 + ...) 
= ps/{ 1 — qs). Moreover u 0 = 1 and u n = p for n > 1, so that 
U(s) = 1 + p(s + s 2 + s 3 H-) = 1+ ps/( 1 — s). Hence U(s) 

- (1 - ?•)/(! ~ •) - 1/(1 ~ F(s)). 

(b) The Classical Gambling; Return to Equilibrium. The time- 
honored gambler of probability textbooks wins or loses a unit amount 
with probabilities p and q, respectively. Let 8 stand for “the accumu¬ 
lated net gain is zero,” so that the gambler is back at the initial posi¬ 
tion. In our terminology we are concerned with Bernoulli trials, and 8 
stands for “the accumulated numbers of successes and failures are 
equal.” Clearly 8 is a recurrent event [cf. example (a) of section 1 
where p = q]. Since this recurrent event can occur only at an even- 
numbered trial, it has period 2. For 8 to occur at trial number 2 n it 
is necessary and sufficient that the 2 n trials result in n successes and n 
failures. Hence 



We know from the normal approximation to the binomial distribution 
(Chapter 7), and we can also readily verify using Stirling’s formula, that 


(3.12) 
so that 

(3.13) 



«2n 


(4p<?) n 

(*»)*’ 



246 


RECURRENT EVENTS 


[12.3 


If p 1/2, then 4 pq < 1 and the series 2 u 2n converges faster than 
the geometric series with ratio 4 pq. Therefore, if p 9 ^ 1/2, the return 
to equilibrium is uncertain . If p = q = 1/2, then u 2n ~ the 

series Su 2n diverges, but w 2n —> 0. Therefore, for p = q = 1/2 the 
return to equilibrium is a certain recurrent event , mean recurrence 

time is infinite . 

We can get even more information from the generating functions. 
Using the readily verified formula 

<-> . c;)-(r) <- 

(cf. Chapter 2, problem 2 of section 9) and the binomial series (Chapter 
2, 6.6), we get from (3.11) 

OO 

(3.15) U(s ) = £ u 2n s 2n = (1 - 4pqs 2 )- y >. 

n= 0 

If p 1/2, then u = 1/(1) = (1 — 4 pq)~' A = \ P ~ q\~ l - From (3.7) 
we conclude that the probability f that the accumulated numbers of 
successes and failures will ever equalize is given by 

(3.16) f=l-\p-q\. 

(This is the probability of at least one return to equilibrium .) 

From (3.5) we get for the generating function of the recurrence times 

(3.17) F(s) = 1 — (1 — 4 pqs 2 )' A . 

This formula is most interesting in the case p = q = 1/2. Then 

(3.18) F(s) = 1 - (1 - s 2 )' A 
and the binomial expansion shows that 

(3.19) hn = (-l) n+1 ( 1/2 ) = - ( 2n ~ 2 ) 2" 2n+1 

\ n / n \n — 1 / 

(/ TO vanishes whenever n is odd). Equation (3.19) gives the distribution 
of the recurrence times for the return to equilibrium in the classical coin- 
tossing game . From (3.18) it follows again that the expected value 
of this recurrence time is infinite. A few surprising features of these 
recurrence times will be discussed in section 5. Numerical values for 
the first seven f 2 n are given at the end of example (l.a). 

(c) Ties in Multiple Coin Games . We consider repeated independent 
tossings of two coins and say that 8 has occurred whenever the accumu- 



FUNDAMENTAL THEOREMS 


12.3] 


247 


lated number of heads (and therefore of tails) is the same for both 
coins. Clearly 



Using formula (9.8) of Chapter 2 and then the normal approximation 
to the binomial distribution, we find that 


(3.21) 


U n = 



1 

( Mr ) H 


Hence 2 u n diverges, but u n —> 0. Therefore 8 is certain but has infinite 
mean recurrence time . 

Let us now more generally consider the simultaneous tossing of r 
coins, and let 8 stand for the recurrent event that all r coins are in the 
same phase (accumulated numbers of heads are the same for all coins). 
Then 


(3.22) 





To estimate u n note that the maximal term of the binomial distribution 
2“ n is of the order of magnitude (2/7rn)^and is smaller than 
Therefore 

(3.23) ■.<•-^{0 + 0 + -'"+0!=”-^ 



Accordingly 2w n converges if r > 4. For r — 2 we saw that 2 u n 
diverges. A special consideration is necessary for the case r == 3. 
From the normal approximation to the binomial distribution we know 
that for sufficiently large n and values of k lying between jtn — and 

\n + nM we have 2~ n > en"" 1/2 , where c is a positive constant (say 

e"" 4 ). Therefore, when r = 3, 

(3.24) w n > 2r^(c 3 n-^) = 2c*/n, 

and hence 2 u n diverges. In other words, with two or three coins it is 
certain that they will sooner or later {and hence arbitrarily often) show 
the same accumulated number of heads. For four or more coins the same 
recurrent event is uncertain . 

(d) Dice . In example (l.c) we considered the recurrent event 8 
that the accumulated numbers of aces, twos, threes, etc., are equal. 



248 RECURRENT EVENTS [12.4 

Obviously 6 has period 6 and u$ n = (6n)!(n!)“ 6 6~ 6n . Using Stirling’s 
formula, it is readily found that Ue n is of the order of magnitude 
so that 2u n converges. Hence 8 is uncertain. From (3.7) it is easy to 
calculate that the probability of a recurrence is about 0.022. 


4. Application of the Central Limit Theorem 

In (2.3) we have defined the random variable S n giving the number 
of trials up to and including the nth occurrence of 8. Often it is more 
convenient to fix the number of trials and consider the number of 
occurrences of 8 as a random variable. This leads to defining the 
variable N r as the number of occurrences of 8 in r trials. The probability 
distributions of S n and N r are related by the obvious identity: 

(4.1) Pr{N r > k) = Pr{S k < r}. 

Now let us suppose that 8 is certain and that the distribution {/„) 
has finite mean /z and finite variance a 2 . Then by the central limit 
theorem (Chapter 10, section 1) for every fixed a 


(4.2) 


\S k - 

i 


< a 


$(«) 


where $( x ) is the normal distribution function. Using (4.1), we can 
reformulate (4.2) as a limit theorem for N r . If we let k and r both' 
tend to infinity so that 2 

(4.3) r — kn ~ aok/^ i 

then 


(4.4) 


Pr{N r > fc} -»*(«). 


To write this relation in a more familiar form we introduce the reduced 
variable N r * defined by 


(4.5) 


Nr* = 


N r - r/n 
' <n*yr* * 


The inequality N r > k is now the same as 


(4.6) 


N* > 


k — r/n 


Using (4.3), the right side is seen to be asymptotically equal to —a, 
and hence (4.4) can be written in the form 

(4.7) Pr{N r * > 

* In this book the sign ~ indicates that the ratio of the two sides tends to unity. 



12.5] 

ARC SINE LAW 

249 

or 



(4.8) 

Pr[N r * < — a} —* 1 — $>(g£). 



We have thus proved the 

Theorem . If the recurrent event 8 is certain and its recurrence times 
have finite mean p and finite variance a 2 , then the number S& of trials 
up to the kth occurrence of 8 and the number N r of occurrences of 8 in the 
first r trials are asymptotically normally distributed as indicated by the 
relations (4.2) and (4.8). 

The limit theorem (4.8) makes it plausible that 

r ra 2 

(4.9) E(N r )~- f Var(AT r )~ — 

m M 3 

but an exact proof requires additional arguments. (Cf. problems 7 
and 11.) Note, incidentally, that (4.§) is an example of the central 
limit theorem applied to a sequence of variables other than sums of 
independent variables. 

The usefulness of the above theorem will be illustrated in Chapter 13, 
where it will be applied to the theory of runs. In deriving it, we have 
assumed that the recurrence times have a finite variance a 2 . This 
restriction at first appears mild, but it turns out that the most interest¬ 
ing recurrence times in various physical processes have infinite mean 
and variance. In this case formulas (4.2) and (4.8) become meaning¬ 
less and must be replaced by more general limit theorems. 3 In the 
following section we discuss a typical example which will reveal many 
surprising features. It will be seen that the types of fluctuations in 
the cases of finite and infinite mean recurrence times are not at all 
similar. 

6. Fluctuations in the Coin-tossing Game; the Arc Sine Law 

We now discuss two interesting theorems which have important 
conceptual and statistical implications. They will reveal the fact that 
widespread beliefs concerning chance fluctuations are based on mis¬ 
conceptions and may lead to dangerous fallacies. We shall here discuss 
only the coin-tossing game, but our results are typical of a wide class 
of fluctuation phenomena, and the two limit theorems are valid under 
much more general conditions. 

Suppose that in a coin-tossing game with unit stakes Peter bets on 
heads and Paul on tails. Denote Peter's accumulated gain at the nth 

3 W. Feller, Fluctuation Theory of Recurrent Events, Transactions of the Amer¬ 
ican Mathematical Society , vol. 67, pp. 98-119 (1949). 



250 


RECURRENT EVENTS 


112.5 


trial by S n , so that Paul’s accumulated gain is — S n . If n is odd, then 
one of the two gains is positive and the other negative, so that one 
of the players is in the lead. If n is even, then it is possible that S n = 0, 
in which case we speak of a tie. The frequency of such ties is so small 
that they play no role in the following considerations. However, a 
constant reference to ties renders the exposition clumsy, and it is 
therefore desirable to eliminate them from our language. Accordingly, 
in the case of a tie we agree to say that Peter leads if he has led in the 
preceding trial. In other words, Peter leads if either S n > 0 or S n = 0 
but S n _i > 0. With this convention, periods of Peter’s lead will 
alternate with periods of Paul’s lead, and the lead passes from one 
player to the other whenever S n changes sign. Note that, if Peter leads 
at an odd trial, he automatically leads at the following trial. 

To fix ideas, let the coin be tossed n = 20 times. The number r of 
trials at which Peter leads may be any even number between 0 and 20. 


TABLE 1 

Distribution of Leads in 20 Trials 


r 

a r 

b r 

0 

0.1762 

0.3524 

2 

.0927 

.5379 

4 

.0736 

.6851 

6 

.0655 

.8160 

8 

.0617 

.9394 

10 

.0606 


12 

.0617 


14 

.0655 


16 

.0736 


18 

.0927 


20 

.1762 



A coin is tossed twenty times; a r is the probability that Peter leads in exactly r 
trials; b r is the probability that the less fortunate player leads in at most r trials. 


One feels intuitively that 10 ought to be the most probable number, 
but this is not so. On the contrary, Table 1 shows that 10 is the least 
probable number and that the extreme tails r = 0 and r = 20 are most 
probable. In more than one-third of all cases we must expect that one 
of the two players will remain in the lead throughout the game. 

This result is very surprising, but a similar statement is true for an 
arbitrary number of trials. If Peter leads in r out of n trials, we call 
r/n the fraction of time during which Peter is in the lead. Table 2 



12.5] 


ARC SINE LAW 


251 


illustrates the general situation for a large number of trials. It is 
supposed that a coin is tossed once per second for a total of 1 year, and 
the fractions of time during which the two players are in the lead are 
denoted by Z and 1 — Z. For definiteness let Z be the smaller of the 
two fractions so that Z lies in the interval from 0 to Table 2 gives 
the values x such that the relation Z < x has probabilities 0.9, 0.8, etc. 
The following features are noteworthy. 

The probability that the less fortunate player leads for more than 
5 months is less than 1/10, and the probability that he leads for a total 
of less than 35 days exceeds 0.4. Turning to the significance levels 
commonly used by statisticians, we note that in one out of twenty 
cases the less fortunate player will lead for less than 13J^ hours, and 
in one out of a hundred cases for a total of 32 minutes. It is difficult 
to imagine a player who would find himself in the red for 364 out of 
365 days without complaining, and yet this should occur in about one 

TABLE 2 

Illustrating the Arc Sine Law 


V 

t p 

0.9 

153.95 days 

.8 

126.10 days 

.7 

99.65 days 

.6 

75.23 days 

.5 

53.45 days 

.4 

34.85 days 

.3 

19.89 tlays 

.2 

8.93 days 

.1 

2.24 days 

.05 

13.5 hours 

.02 

2.16 hours 

.01 

32.4 minutes 


A coin is tossed once per second for a total of 365 days; let Z be the fraction of 
time during which the less fortunate player is in the lead. Then t p is the interval 
such that the event Z < i v has probability p, approximately. 

out of sixteen cases. Few people would believe that a true coin will 
lead to preposterous sequences where S n does not change sign in millions 
of trials, and yet this should occur rather frequently. We have here 
the simplest example of the behavior of chance fluctuations in proc¬ 
esses depending on cumulative chance effects. Processes of this kind 
are usual in economical and sociological phenomena, and our findings 
should serve as a warning to those who easily discover “obvious” 
secular trends or deviations from average norms. 



262 ' RECURRENT EVENTS 112.6 

We now give the formulas from which the two tables are constructed 
but postpone the proofs to the next section. 

Theorem 1* The 'probability that Peter leads in 2 r out of 2 n trials is 

/2r\ /2n - 2r\ 0 

(5.1) P2r,2n = ( )( ) 2~ 2 *. 

\r / \ ft — r / 

Here r/n = Z is the fraction of time during which Peter leads. The 
probability that Z < k/n is obviously p 0 , 2 n + P 2 , 2 n + • • • + 7 > 2 fc, 2 n. 
From theorem 1 we shall derive 

Theorem 2. The Arc Sine Law . 5 Let n be large , and let Z be the fraction 
of time during which Peter leads. Then the probability that Z < t is 
given approximately by 

2 

(5.2) f(t) = - arc sin f , 

7T 

with the error tending to zero as n —> °o. 

Here £ may be any fixed number in the interval (0, 1). The fraction 
of time during which the less fortunate player leads lies necessarily in 
the interval (0, Y), and the corresponding probabilities are obtained 
by doubling/(f). Figure 7 illustrates the arc sine law; Table 2 was 
computed from it by converting the fraction t to days. For a surprising 
counterpart to the arc sine law see problem 4. 

The following theorem throws light on the same situation from 
another point of view. The explanation of the arc sine law lies in the 
fact that an enormous number of trials are frequently required before 
the numbers of heads and tails equalize. This statement is rendered 
more precise by the following theorem, in which <f>(z) is the normal 
distribution function. 

4 K. L. Chung and W. Feller, On Fluctuations in Coin-tossing, Proceedings of 
the National Academy of Sciences , vol. 35 (1949), pp. 605-608. This theorem was 
suggested by the discovery of E. Sparre Andersen that (5.1) gives the probability 
of exactly 2r changes of sign in the sequence of the first 2n partial sums X\ + • • • 4* 
Xye where the Xk are mutually independent random variables with a common con¬ 
tinuous distribution (cf. On the Number of Positive Sums of Random Variables, 
Skandinavisk Aktuarietidskrift t 1950). The formula is not true for general discrete 
variables. Our derivation leads also to a formula for p*, 2 n+i- 

5 Paul L6vy [Sur certains processus stochastiques homog&nes, Compositio Maihe - 
matica f vol. 7 (1939), pp. 283-339] found the arc sine law for certain continuous 
diffusion processes and referred to the connection with the coin-tossing game. A 
general arc sine law for the number of positive partial sums in a sequence of mutually 
independent random variables was proved by P. Erdds and M. Kac, On the Num¬ 
ber of Positive Sums of Independent Random Variables, Bulletin of the American 
Mathematical Society , vol. 53 (1947), pp. 1011-1020. 



12.5] 


ARC SINE LAW 


253 



Theorem 3. 6 Consider n independent coin-tossing games each of which 
is continued until for the first time the accumulated numbers of heads and 
tails are equal Let F lf Y 2l • • •, Y k be the durations of these n games and 


(5.3) 




their average . Then for every fixed positive a we have approximately 
(5.4) Pr{Y k < ka\ « 2{1 - 


and the difference of the two sides tends to zero as n —► 

The surprising feature of this theorem is the following. In the 
language of the theory of observations Y i} • • •, Y k are k independent 
measurements on the same physical quantity [namely, the recurrence 
time of the return to equilibrium as discussed in examples (l.o) and 
(3.6)]. The standard theories would lead us to expect that the average 
Y k will, with increasing k y settle down to a “true” value, which would 
be called the mean value of the recurrence time. In our case the mean 
recurrence time is infinite [example (3.6)], and therefore Yk increases 

• In the form of a limit theorem this theorem is clue to P. L£vy (cf. the reference 
in footnote 5). The present derivation leads to excellent numerical approxima¬ 
tions even when n *■ 1. Theorem 3 is a special case of a class of limit theorems 
which replace the central limit theorem in the case where the recurrence times have 
no finite variance. (Cf. the reference in footnote 3.) 


















254 


RECURRENT EVENTS 


[12.5 


over all bounds. The surprising feature is that F& has a probable 
order of magnitude k f so that the sum Y\ + • • • + Yk increases like 
k 2 . In fact, the value a for which the right side of (5.4) equals 1/2 is 
a = 2.198_ We conclude that each of the two relations 


TABLE 3 

Probability Distributions of the Averages Yk of Theorem 3 


V 

Ox 

010 

«100 

0.5 

3 

23 

221 

.4 

5 

37 

365 

.3 

7 

69 

675 

.2 

15 

157 

1,559 

.1 

63 

635 

6,337 

.05 

255 

2,545 

25,441 

.01 

6,359 

63,593 

635,930 


au is the number of trials for which Pr {Yh > ak\ comes nearest to the probability 
p given in the first column. 


TABLE 4 

Distribution of the First Return to Equilibrium Together with 
Approximation (5.4) 

Pr[Yi < 2 n) = 

n / s +/ 4 +•••+/*» 2{1 - $«2n)-H)} 


1 

0.5 

0.4795 

2 

.625 

.6170 

3 

.6875 

.6831 

4 

.7266 

.7237 

5 

.7539 

.7518 

6 

.7744 

.7728 

7 

.7905 

.7893 

8 

.8036 

.8026 

9 

.8145 

.8137 

10 

.8238 

.8231 


Yk < 2.198 k and Yk > 2.198fc has a probability near 1/2. In 10 
observations the average duration of the game is as likely as not to 
exceed 22; in 100 observations the same average is equally likely to 
fall short of or exceed 220; and in 1000 observations the average has 
about even chances to exceed' 2200. This phenomenon contrasts 
sharply with the familiar theory of the “stability” of the means of a 
large number of “good” measurements. The unorthodox behavior of 



12 . 6 ] 


PROOF OF THE THEOREMS OF SECTION 6 


255 


our averages F& is due to the fact that in n measurements of our recur¬ 
rence times one or two are likely to be overwhelmingly large as com¬ 
pared with all others. This fact reveals the danger in the practice of 
many experimentalists of “throwing away” excessively large observa¬ 
tions. With this frequently advocated method the physicist is bound 
to obtain spurious finite values for recurrence times even though they 
have infinite means. Note also that the St. Petersburg game discussed 
in Chapter 10 is closely related to the present considerations. Fortu¬ 
nately we are able to replace the classical paradoxes by limit theorems 
which can be checked experimentally. 

* 6. Proof of the Theorems of Section 6 

The following proofs are of methodological interest as they depend 
on double generating functions. Similar ideas are used in the more 
advanced theory, but an understanding of them is not required for 
the present volume. 

Proof of Theorem L By p 2 r, 2 H we denoted the probability that Peter 
leads in exactly 2r out of the first 2 n trials. For n = 0 this probability 
is not defined, but it is convenient to put p 0 ,o = 1* Similarly, we put 
P 2 r, 2 n = 0 if v > u. Peter is betting on heads. 

Consider first the case 0 < r < n. The event “Peter leads in 2r and 
Paul leads in 2 n — 2r trials” can occur in the following mutually 
exclusive ways. (1) The first trial results in tails; the first equalization 
of the numbers of heads and tails occurs at the trial number 
2k (fc = 1 , 2, • • •, n). Then Paul leads in the first 2k trials, and there¬ 
fore Peter leads in exactly 2r out of the last 2 n — 2k trials. (2) The 
first trial results in heads, and the first equalization occurs at the trial 
number 2k (k = 1, • • •, r). Then Peter leads in the first 2k trials and 
in exactly 2r — 2k out of the last 2 n — 2k trials. 

We now compute the corresponding probabilities. The probability 
f 2 k that the first equalization occurs at the trial number 2k is given by 
formula (3.19); the corresponding generating function is given by 
(3.18). Whenever the numbers of heads and tails are equal, the 
game starts anew, and therefore the probability of the event (1) is 

'fik * P 2 r, 2 n_ 2 fc* Similarly, the probability of the event (2) is 

\ 'f2k * P2r — 2k,2n—2k‘ HeilCe 

1 » 1 n 

(6.1) p 2r ,2n = “ 2 f%kP 2r,2n — 2k + “ f2kP2r-2k,2n-2k- 

2 *=1 * *=1 

* Starred sections may be omitted at first reading. 



256 


RECURRENT EVENTS 


[12.6 


This argument .is not valid when r = 0 or r = ft, since then the 
enumeration of the possibilities mentioned under (1) and (2) is not 
complete. 7 The events that Peter never leads and that he leads all 
the time can occur also if no equalization of heads and tails occurs 
during the first 2 n trials. Consequently, for r = 0 and r = nwe have 
to add to the right side of (6.1) the quantity 

1 

(6.2) q2n = “ (/2n+2 + /2n+4 + f2n+Q + •••)• 

We now introduce the generating functions 

n 

(6.3) P 2n(s) = Z P2r,2„S 2r . 

r=0 

Multiply (6.1) by s 2r and add over all r, remembering that for r = 0 
and r = n the right side should be increased by the quantity (6.2). 
We get 


i°2n(s) 

(6.4) 

- l £ hkP2n-2k(s) + J Z f2kS 2k P2n-n(s) + (1 + S 2n )<? 2 „. 

2 *== i 2 k==i 

To solve this equation we introduce the generating function of the 
sequence P^nis), namely, 

00 

(6.5) P(s, 0 = Z P 2 n(s)t 2n . 

n= 0 

Multiply (6.4) by t 2n and add over all n. To the left we get P($, t). The 
first sum on the right is the convolution of the sequences [fk\ (in 
which all odd terms vanish) and jP 2 »W, so that the generating function 
of this sum is the product of P($, t) and the generating function of 
{/*} [the latter is the function F(t) defined in (3.18)]. The same 
argument applies to the second sum except that the sequence {/*} is 
now replaced by {/&$*}, the generating function of which is clearly 
F(st). Finally, the generating function of the sequence q 2n is 

7 If r ** 0 the event (2) cannot occur, but the terms P2r-2k,2n-2h vanish auto¬ 
matically. Similarly, for r * n we have P 2 r, 2 n- 2 A » 0, so that in either case there 
remains only one sum in (6.1). 



12 . 6 ] 


PROOF OF THE THEOREMS OF SECTION 5 


257 


( 6 . 6 ) 


0© 1 00 

22 Q2nt 2n — - 22 £ 2n (/2n+2 + /2n+4 + • • •) 
n=s»0 ^ n=0 

= \ihk^+t 2 +---+t 2k - 2 ) 

1 * 1 - t 2k 11 - F(t) 

= - (1 - £ 2 ) -M . 

2 


The generating function of the sequence \s 2n q 2 n) is obtained from the 
last formula on replacing t by st. Hence 

(6.7) 2 P(s, t) = \F(t) + F(st)}P(s , 0 + (1 - * 2 r K2 + (1 - 
Now 1 — F(£) = (1 — and therefore 

( 6 . 8 ) P(s, 0 = (1 — * 2 )“ H (1 - 

The required probability p 2 r. 2 n is the coefficient of s 2r tr n in P(s, 0- 
Using the binomial expansion, we find 

(6.9) ("DC 


and this formula reduces to (5.1) [cf. formula (9.2) of Chapter 2]. This 
proves theorem 1. 

Proof of the Arc Sine Law. To evaluate the probability (5.1) we re¬ 
duce the binomial coefficients to factorials and use Stirling’s formula. 
We get (as we did in Chapter 7) 


( 6 . 10 ) 


P2r,2n 


1 

7r r v \n — r)' A 1 


where the ratio of the two sides tends to unity as r —> <» and n — r <x>. 

Now let 0 < a < fi < 1 be fixed, and let us evaluate the probability 
x(ct y fi) that the fraction of time r/n during which Peter leads lies 
between a and fi. For fixed n this probability is obtained by adding 
P 2 r, 2 n for all r in the interval an < r < fin. Hence 


( 6 . 11 ) 


x(a, 0) ~ ir 1 2 


{(r/n)( 1 - r/n) 


n 


-l 



258 RECURRENT EVENTS [12.6 

the sum extending over all r in the interval (an, fin). This sum is a 
Riemann sum approximating the integral 


(6.12 ) it -1 f - rz = 2ir 2 farc sin fi^ — arc sin a^}. 

We cannot apply this formula to a = 0 since (6.10) is not correct for 
small r. However, it follows from (6.12) that 

z(0, t) = 27r“ 1 arc sin t' A + C, 

where C is a constant. For reasons of symmetry we have obviously 
z(0, 1/2) = 1/2, and hence C = 0. This proves (5.2). 

Proof of Theorem 8. We introduce the probability distribution of 
the sum Y\ + • • • + Y n 


(6.13) qk, 2 n = Pr{Yi + • • • + Yk = 2 n}. 

Since each of the variables Y v has the generating function F(s) given 
by (3.18), the generating function of their sum is 

(6.14) F k (s ) = {1 — (1 — s 2 )' A } k . 

To get qk, 2 n we have to find the coefficient of s 2n . Our first aim is 
to prove that 8 



From the obvious relation F k (s) = 2 F k l (s) — s 2 F k 2 (s) we get the 
recursion equations 

(6.16) qic, 2 n == 2^—l, 2 n qk—2,2n—2- 

This formula enables us to calculate all qk, 2 n if the q\ t 2 n and q 2 , 2 n are 
known. Now gi i2n is the probability that the first return to equilibrium 
occurs at the trial number 2n. Therefore qi, 2 n = / 2 n with / 2n defined 
in (3.19). This formula checks with (6.15) for k = 1. Next, g 2f2n is 
the coefficient of $ 2n in F 2 (s) = 2 F(s) — s 2 , so that g 2|2 = 0 and 
22 , 2 n = 2^ lf2n for n > 1. Again this formula checks with (6.15). To 
prove that (6.15) holds for all k it suffices therefore to show that the 
quantities (6.15) satisfy the recurrence relations (6.16), and this is 
easily verified. This proves (6.15). 

8 Note that (6.15) gives an explicit formula for the probability that the Aith 
return to the equilibrium occurs at the 2nth trial. 



12.7] 


PROOF OF THEOREM 3 OF SECTION 3 


269 


We can evaluate (6.15), using the normal approximation to the 
binomial distribution. If n —*« and k(2n — k)~' A remains bounded, 
then 


(6.17) 

and therefore 

(6.18) 


er) 


2-2"+*= ~ | _ *(2 n - k ) | e -*V(2(2n-*)) > 

2 




Qk,2n 1 


0 ’ 


k 


(2 n - k) } 


-k 2 /2(2n-k) 


Here the degree of the approximation increases rapidly as n —> 
Using (6.18), we can evaluate the probability 


(6.19) 


x a = Pr{Y k > ka | = £ < 7 fc, 2 n, 

2 n>ak 2 


the summation extending over all n exceeding ak 2 /2. When k is large 
and n > ak 2 / 2, we may in the limit replace 2 n — k by 2n and get from 
(6.18) the approximation 


( 6 . 20 ) 


x a 


„-*V4n 


2tt« »>^v 2 (n/k 2 )' A 


1 


On the right vve have a Riemann sum (with Ax = k 2 ) approximating 
the integral 


(6.21) -4; f t~ H e- VW) dt, 

2ir H J tt / 2 

which by the change of variables 2^ = j"" 2 is transformed into 
2$(a“^) — 1. This verifies (5.4). We have proved only that the 
difference of the two sides in (5.4) tends to zero as k —> °o. Actually 
the approximation is excellent even when k is small. (It can be further 
improved by keeping the terms 2n — k instead of replacing them by 
2 n. This changes the limits in the integral only slightly.) Table 4 
gives the distribution of Yi together with the approximation (5.4) 
when a is an even integer. It will be noticed that even the first term 
gives a surprisingly good approximation. 


* 7 . Proof of Theorem 3 of Section 3 

In section 3 we have omitted the proof of theorem 3. This theorem 
can be formulated either as a “Tauberian” theorem on power series 
or in an elementary way as follows. Given a sequence {/ n } such that 

* Starred sections may be omitted at first reading. 




260 RECURRENT EVENTS [12.7 

/o = 0, /„ > 0, 2/ n = 1, and that the greatest common divisor of those n 
for which f n > 0 is one. Let u 0 = l and define u n for n > 1 by 

(7.1) U n = /lWn-l + f 2 Un —2 +•••*+* fnU o 

is (3.3)]. Then u n —> l//x, Mere ju = 2n/ n (and u n —> 0 2n/ n 

diverges ). 

For the proof put 

(7.2) r n = / n4 .i + / n4 . 2 + .. 
so that [by formula (1.8) of Chapter 11] 

(7.3) /x = 2r n . 

From (7.2) we get r 0 = 1, fi = r 0 — r u f 2 = r 1 — r 2 , etc. Sub¬ 
stituting these values into (7.1) we find that r 0 u n + r^n^i + ••• 
+ r n u 0 = r 0 n n ^i + r x u n _ 2 + • • • + r n _ x u 0 . If the left side is called 
A n , then the right side is A n _i, and our equation states that all A n 
are equal. Now A 0 — r 0 u Q = 1, and hence A n = 1 for all n. Thus 
we have for every n 

(7.4) r 0 u n + r 1 u n „ 1 + ••• + r n u 0 = 1. 

From (7.1) it follows by induction that u n < 1. Hence there exists 
a number X = lim sup u n such that for any e > 0 and all sufficiently 
large n we have u n < X + c, while there exists some sequence n x , n 2} 
n 3 , ... such that u Uv —> X. Choose an integer j > 0 such that fj > 0. 
We claim that u n> ,_ ; —> X. If this were not so, we could find arbitrarily 
large subscripts n such that simultaneously 

(7.5) Wji ^ X — €, Un — j ^ X ^ X. 

Now let N be so large that < e. Since Uk < 1, we have then from 
(7.1) for n > N 

(7.6) U n < f 0 u n + jfiU n _i + * • • + fNUn—N + €• 

For sufficiently large n each Uk on the right side is less than X + c, and 
u n -j < X'. Hence 

u n < (/o + fi + • • • + /y-i + /;■+• 1 + * • • + /at)(X + c) 

• + + c 

< (1 ~ /i)(^ + «) + /jV + € 

< X + 2c - /j(X - X'). 


(7.7) 



12 . 8 ] 


PROBLEMS FOR SOLUTION 


261 


If we choose e so small that fj(\ - X') > 3e, then the last inequality 
contradicts the first one in (7.5), so that the assumption X' < X is 
impossible. 

This proves that, whenever u np —> X, also u nj ,_y —> X. Repeating the 
argument, we see: if fj > 0 and u Uv —> X = lim sup u n , then also 

^n v —j * X, ^n v — 2j > ^ny—Sj * X, etc. 

For simplicity let us first consider the case where / x > 0. Then we 
can take j = 1 and conclude that u nv _ k —> X for every fixed k. From 
(7.4) we find for n = n v 

(7.8) 1 > r 0 u ny + r\U n9 _i + • • • + rNUn v -N- 

For fixed N every u nv - k —> X, so that 1 > X(r 0 + r x + * * * + r n). 
Since N is arbitrary, we conclude that 1 > Xg or X < l//x. 

Next, let 7 = lim inf u n . The same argument shows that, for every 
sequence n v for which u Uy —> 7 , also u ny - k 7 * If N is large enough 
that rjv < e, then from (7.4) 

(7.9) 1 < i\)U nv + • • • + r N u nv _ N + e; 

herein u ny _ k —>7 so that 1 < (r 0 + • • • + r N ) 7 + e and hence 
ny > 1. However, by definition, 7 < X. Therefore 7 = X = l//x, as 
was to be proved. 

There remains the case where /1 = 0 . Consider then the collection 
of all integers j for which/, > 0. Among them we can find a finite col¬ 
lection a, 6 , c, • • •, m whose greatest common divisor is 1 . We know 
that, when u Hv —*\ also v ny - r a-*\ Un v - V b-+\ etc*, for every fixed 

x > 0, y > 0, • • •, w > 0. Hence also u ntf - xa -yb - wm X. In other 

words, if an integer k is of the form k = xa + yb + • • • + wm with 
positive integers x, y> • • *, iv , then u ny - k —> X. Now it is known from 
elementary number theory that every integer k exceeding the product 
abc • • • m can be written in this form. This means that for k > 
abc • • • m we have v ny - k —► X. To get the inequality (7.8) it suffices 
now to apply (7.4) to n = Uy + ab • • • m. The remaining part of the 
proof requires no change. 

8. Problems for Solution 

1. Suppose that F(s) is a polynomial. Prove for this case all theorems of sec¬ 
tion 3, using the partial fraction method of Chapter 11, section 7. 

2. Let r coins be tossed repeatedly and let 8 be the recurrent event that for 
each of the r coins the accumulated numbers of heads and tails are equal. Is 6 
certain or uncertain? For the smallest r for which 8 is uncertain, estimate the 
probability that 8 ever occurs. 



262 


RECURRENT EVENTS - (12.8 


3. Suppose that Peter and Paul toss a coin for unit stakes and let S n be Peter’s 
accumulated gain at the conclusion of the nth trial. Prove the following 

Theorem. The probability that S n becomes negative for the first time at the trial 
number n = 2k + l equals the probability fa that S n = 0 for the first time at the trial 
number 2k. 

Hint: Use generating functions, in particular (3.18). 

4. The following theorem accentuates the surprising features of the arc sine 
law. 

In the coin-tossing game of section 5 let Z n be the fraction of time during which 
Peter leads. Suppose that at the 2nth trial the accumulated numbers of heads 
and tails are equal. Under this hypothesis the conditional probability that Z n ~ r/n 
equals 1/n, that is, all possible fractions are equally probable. 

Hint: Follow the proof of theorem 1. The present proof is actually simpler. 
(Cf. the first paper quoted in footnote 4.) 

5. Derive theorem 1 of section 5 from the preceding problem. 

6. Let 8 be a certain aperiodic recurrent event. Assume that the recurrence 
time has finite mean u and variance <r 2 . Put q n = /«+i + /n +2 +... and r n = 

q n +i + gn+2 +_ Show that the generating functions Q(s) and R{s) converge 

for s = 1. Prove that 


( 8 . 1 ) 

and hence that 

( 8 . 2 ) 


2 (--;)*• 


R(s) 

mQ(«) 




cr 2 — M + M 2 
2m 2 


7. Let 8 be a certain recurrent event and N r the number of occurrences of 8 
in r trials. Prove that E(N r ) « u\ +•••-}- u r and hence 


(8.3) E(N r ) ~ -• 

8. Continuation. Prove that 

r—1 

E(N r 2 ) *=* u\ A -h u r + 2 ufiui H-f- u r —j) 

y«i 

and hence that E(N r 2 ) is the coefficient of s r in 

f84 v gw ± m 

(1 ~ S){1 -F(s)) 2 

9. Let qjc,n ** Pr{Nk — n\. Show that qk,n is the coefficient of s* 5 in 


(8.5) 


F n (s) 


U -F(s )\ 

1—8 


Deduce that E(N r ) and E(N r 2 ) are the coefficients of s r in 

(!-«){! -?{*)} 


( 8 . 6 ) 

and (8.4), respectively. 



12.8] 


PROBLEMS FOR SOLUTION 


263 


(8.7) 


10. Using the notations of problem 6, show that 

F(s) 


It _\2 


R(s) 


(1 - s){l - F(s)| 1 -3 m(1 — «) 2 A*{1 - F«) 

Hence, using the last problem, conclude that 


( 8 . 8 ) 


r + 1 (T^ — n — u? 

E{Nr) - - + <r 


M 2/x J 

with e r —> 0. 

11. Continuation. Using a similar argument, show that 


(8.9) 


(r + 2)(r + l) , 2 <t 2 -2m-^ . 

E(N/) =- 5 - 1- - - r + ctr, 


where a r remains bounded. Hence 


(8.10) Var(N r )~- 1 r- 



* CHAPTER 13 


RECURRENT EVENTS: 

APPLICATIONS TO 
RUNS AND RENEWAL THEORY 

The power of the method of recurrent events is well illustrated by 
the theory of success runs in Bernoulli trials. This is a classical topic 
which has found applications in modern statistics. The new approach 
simplifies the theory and (cf. section 2) applies also to more general 
recurrent patterns which present special difficulties to older methods. 
This unification is the main advantage of the methods developed in 
Chapter 12. 

Section 3 illustrates the method of partial fractions by a detailed 
numerical analysis of an asymptotic formula derived in section 1. 

The remainder of the chapter is devoted to the renewal theory, a 
topic to which an extraordinary number of papers has been devoted. 
The main theorems are derived in section 4 as simple corollaries to 
theorems on recurrent events. Applications of renewal theory are 
described in section 5. 

1. Success Runs 

The term ‘‘success run of length r” has been defined in several ways 
[cf. also example (1.6) in Chapter 12]. It is largely a matter of con¬ 
vention and convenience whether a sequence of three consecutive 
successes is said to contain 0, 1, or 2 runs of length 2, and for different 
purposes different definitions have been adopted. However, if we are 
to use the theory of recurrent events, then the notion of runs of length 
r must be defined so that we start from scratch every time a run 
is completed. This means adopting the following definition. A 
sequence of n letters S and F contains as many runs of length r as there 
are non-overlapping uninterrupted successions of exactly r letters S. In a 
sequence of Bernoulli trials a run of length r occurs at the nth trial , if the 
nth trial adds a new run to the sequence. Thus in SSS | SF | SSS | SSS 
we have three runs of length 3, and they occur at trials number 3, 8,11; 
there are five runs of length 2, and they occur at trials number 2, 4, 7, 

* Starred chapters treat special topics and may be omitted at first reading. 

264 



13.1] 


SUCCESS RUNS 


265 


9, 11. This definition has the advantage of a very considerable sim¬ 
plification of the theory. It amounts to counting successions of at 
least r consecutive successes except that 2r consecutive successes count 
twice, etc. 

In the sequel r is a fixed integer and 8 means the occurrence of a 
success run of length r. We shall suppose that the trials are Bernoulli 
trials, and 8 is then a recurrent event. As before, u n denotes the prob¬ 
ability of 8 at the nth trial (u () = 1), and f n the probability that the first 
run of length r occurs at the nth trial, so that \fj\ is the distribution of 
the recurrence times of 8. 

The probability that the r trials number n, n — 1 , n — 2, •••, 
n — r + 1 result in success is obviously p r . In this case 8 occurs at 
one among these r trials; the probability that 8 occurs at the trial 
number n — k (k = 0, 1, • • •, r — 1) and that the following k trials 
result in success is u n _kp k - Since these r possibilities are mutually 
exclusive, we get the following recurrence relation: 1 

(1.1) u n + M„_ip + • • • + U„_ r+t p r_1 = p T . 

This equation is valid for n > r. Clearly 

(1.2) Ui « u 2 = • • • = u r -i = 0 , u 0 = 1 . 


Now multiply (1.1) by s n and sum over n = r, r + 1, r + 2, • • •. In 
view of (1.2) we get on the left side 

(1.3) { U(s) - 1) (1 + ps + p 2 s 2 + • • • + p^V” 1 ) 


and on the right side p r (s r + s r+l + •••)• 
metric series, we find 


(1.4) 


\U(s) -1} 


1 - (ps) r 
1 — ps 


Summing the two geo- 

p r s r 
l — s 


or 

(1.5) 


U(s) = 


1 — 5 + qp r s r+1 

(T-*)(1 - V r s r Y 


Using (3.5) of Chapter 12, we get for the generating function of the 
recurrence times 


( 1 . 6 ) 


m 


pV( 1 — ps) 

1 — s + qp r s r + 1 


1 The classical approach consists in deriving a recurrence relation for / n . This 
method is more complicated and does not apply to, say, runs of either kind or 
patterns like SSFFSS , to which our method applies without change [of. example 
(2.c)]. Our method depends on the modified definition of runs which makes them 
recurrent events. 



266 


RUNS AND RENEWAL THEORY 


[13.1 


It is seen that F(l) = 1 so that runs of length r are certain recurrent 
events. Clearing (1.6) of the denominator and differentiating im¬ 
plicitly, it is easy to calculate F'(l) and F*( 1). We find that the mean 
and variance of the recurrence times of success runs of length r are 


(1.7) 


1 - V T 2 _ 1 2r + 1 p 

qp r ’ ” ( qp r ) 2 qp r ’ 


respectively. The theorem of section 4 of Chapter 12 implies that for 
large n the number N n of runs of length r produced in n trials is approxi¬ 
mately normally distributed , that is, for fixed a < p the probability that 


n aan^ n flan 

- 4-< Nn < - 4-^ 

MM mm 


( 1 . 8 ) 


Yi 


tends to $($) — <f>(a). This fact was first proved by von Mises, but 
without the theory of Chapter 12 the proof requires rather lengthy 
calculations. Table 1 gives a few typical means of recurrence times. 


TABLE 1 

Mean Recurrence Times for Success Runs if Trials Are Performed at 
the Rate of One per Second 


Length of Run 

p = 0.6 

p — 0.5 (Coins) 

p — 1/6 (Dice) 

r - 5 

30.7 seconds 

1 minute 

2.6 hours 

10 

6.9 minutes 

34.1 minutes 

28.0 months 

15 

1.5 hours 

18.2 hours 

18,098 years 

20 

19 hours 

24.3 days 

140.7 million years 


The method of partial fractions (Chapter 11, section 7) permits us 
to derive an excellent approximation to the probabilities f n that the 
first run occurs at the nth trial. The denominator in (1.6) can be 
factored: 


1-34- «)V +1 

(1.9) 

= (1 - ps){l - qs(l + ps 4- • • • 4- (ps) r x )}- 

The first factor has the root $ = 1/p, which is also a root of the numer¬ 
ator and should be cancelled. We have, therefore, only to consider 
the roots of the second factor, that is, 

qs( 1 + ps + • • • + p r "' 1 s r “ 1 ) = 1. 


( 1 . 10 ) 



13.1] 


SUCCESS RUNS 


267 


Obviously this equation has a unique positive root s = x. For every 
real or imaginary number with \ s \ <xwe have 

(1.11) | ^5(1 + ps + • • ■ + p r- V -1 ) | 

< qx{ 1 + px + • • ■ + p r ~ 1 x r ~ 1 ) = 1 


where the equality sign is possible only if all terms on the left have 
the same argument, that is, if s — x. Hence x is smaller in absolute 
value than any other root of the denominator in (1.6). We can, 
therefore, apply formula (7.9) of Chapter 11 with Si = x. The co¬ 
efficient pi is easily computed from (7.5) with U(s) = p r s r ( 1 — ps) 
and F($) = 1 — 5 + qp r s r+l . We find, using that V(x) = 0, 


( 1 . 12 ) 


(x — 1)(1 — px) 1 
(r + 1 — rx)q x n+l 


The probability that n trials result ii> no run is q n = / n+i +/ n + 2 
+ }n +3 + .... We get for it from (1.12), summing a geometric 
scries 2 


(1.13) 


Qn 


1 — px 1 

(r + 1 — rx)q x nJtl 


We have thus found that , if x is the unique positive root of (1.10), the 
probability that n trials produce no success run of length r is, asymp¬ 
totically, given by (1.13). Table 2 shows that the formula gives sur¬ 
prisingly good approximations even for very small n, and the goodness 
of approximation increases rapidly with n. We have here a typical 
example of the power of the method of generating functions combined 
with the method of partial fractions. In section 3 we shall see that the 
calculation of x is rather easy and that estimates of the error in (1.13) 
can be obtained. 


TABLE 2 


Probability of Having No Success Run of Length r = 2 in n Trials with 




■e 

II 

& 


Bound for 



Approxima¬ 


Error Accord¬ 

n 

q n Exact 

tion (1.13) 

Error 

ing to (3.10) 

2 

0.75 

0.76631 

0.0163 

0.0835 

3 

.625 

.61996 

.0080 

.042 

4 

.500 

.50156 

.0016 

.021 

5 

.40625 

.40577 

.0005 

.010 

* A special case of (1.13) is equation (7.13) of Chapter 11, 

example (7a). 



RUNS AND RENEWAL THEORY 


[13.2 


2. More General Patterns 

Our method is applicable to more general problems which have 
been considered as considerably deeper than the theory of runs. 
Two special examples of recurrent patterns were treated in Chapter 11, 
examples (7 .a) and (7.6). Here we treat two more interesting problems 
whose solution is surprisingly simple. 

Examples, (a) Runs of Either Kind . Let 8 stand for u eithera 
success run of length r or a failure run of length p” We are dealing with 
two recurrent events 8 i and 82 , where 81 stands for “success run of 
length r” and 82 for “failure run of length p” and 8 means “either 81 
or S 2 .” To 81 there corresponds the generating function (1.5) which 
will now be denoted by Ui(s). The corresponding generating func¬ 
tion £7 2 (s) for S 2 is obtained from (1.5) by interchanging p and q , and 
replacing r by p. Now the probability u n that 8 occurs at the nth 
trial is the sum of the corresponding probabilities for 81 and 82 . An 
exception occurs for n = 0, where u 0 = 1. It follows that 


(2.1) U(s) = Ut(8) + U 2 (s) - 1. 

The generating function F(s) of the recurrence times of 8 is again 
obtained from (3.5) of Chapter 12 . We have F(s) = 1 — C/ 1 («) or 


( 2 . 2 ) 


F(s) = 


(l — ps)p r s r ( 1 — q p s p ) + (1 — qs)q p s p ( l — p r s r ) 
1 — s + qp r s r+1 + pq p s p+l — p r q p s r + p 


The mean recurrence time follows by differentiation 


(2.3) n = —--— • 

qp r + pq p - p r q p 

As p —> 00 , this expression tends to the mean recurrence time of success 
runs as given in (1.7). 

(b) In Chapter 8 , section 1, we calculated the probability x that a 
success run of length r occurs before a failure run of length p. Suppose 
now that this event is made the object of a bet and that we are inter¬ 
ested in the probability distribution of the duration of the game . More 
precisely, we define two recurrent events 81 and 8 2 as in example (a). 
Let x n = probability that 81 occurs for the first time at the nth trial 
and no 8 2 precedes it;/ n = probability that 81 occurs for the first time 
at the nth trial (with no condition on 8 2 ). Define y n and g n as x n 
and / n , respectively, but with 81 and S 2 interchanged. 

The generating function for f n is given in ( 1 . 6 ), and G(s) is obtained 
by interchanging p and q, and replacing r by p. For x n and y n we have 
the obvious recurrence relations 



13.3] NUMERICAL ESTIMATES 239 


(2.4) X„ — /„ — (yifn-l + Vlfn-I + • • • + Vn-ljl) 


Vn = 9n~ (xiQn-1 + X 2 g n -2 + 1 • • + Zn-10l). 

These equations are of the convolution type, and for the corresponding 
generating functions we have, therefore, 

(2.5) X(s) = F(s) - Y(s)F(s) 

Y(s) = G(s) - X(s)G(s). 

These are two linear equations in the unknowns X(s) and Y(s). We 
get 


( 2 . 6 ) 


X(s) = 


^(•s) {1 - 

I - F(*)i?(«r ’ 


goou - m\ 

1 - F(s)G(s) 


Expressions for x n and y n can again be obtained by the method of 
partial fractions. For s = 1 we get A’ (1) = Sx n = x, the probability 
of £j occurring before S 2 . Now both' numerator and denominator 
vanish, for s = 1, and X(1) is obtained from L’Hospital’s rule differen¬ 
tiating numerator and denominator: X(l) = G , (1)/{F , (1) + G'(l)|■ 
Using the values F'(l) = (1 — p r )/qp r and G”(l) = (1 — q 9 )/p<f 
from (1.7), we find X(l) as given in equation (1.3) of Chapter 8. 

(c) Consider the recurrent event defined by the pattern SSFFSS. 
Repeating the argument of section 1, we find easily that 

(2.7) pV = «» + pV M »- 4- 

From this relation we get the generating function and from it the 
mean recurrence time n = p~ 4 q~ 2 + p~~ 2 - For p — q — 1/2 we find 
/jl = 68, whereas the mean recurrence time for a head run of length 6 
in coin tossing is 126. This shows that there is an essential difference 
between head runs and other patterns of the same length. 


3. Numerical Estimates 3 

In this section we propose to show that the positive root x of the 
algebraic equation (1.10) can be calculated very easily; in addition, 
we shall derive estimates of the error involved in the asymptotic 
expansions (1.12) and (1.13). 

We remember that equation (1.10) has been obtained from (1.9), so 
that all roots of (1.10) are also roots of the equation 

(3.1) l- $ + qp r s r+l = 0. 

8 This section contains a simplification and slight, improvement of the results in 
Chapter 5 of Uspensky’s book, Introduction to Mathematical Probability , New York, 
1937. The results will not be used in the sequel. 



270 


RUNS AND RENEWAL THEORY 


[13.3 


This equation has also the extraneous root s = 1 /p, but x is the 
only other positive root (unless $ = 1/p is a double root, in which 
case x = 1/p). 

For abbreviation we put 1 + qp r s r+l = f(s), so that (3.1) becomes 
/($) = s. We know that the graph of y = f($) crosses the bisector 
y = s at s = x and at 5 = 1/p. Between these two roots the graph 
of f(s) lies under the bisector. At the smaller root we must, therefore, 
have f'(s) < 1, while at the larger root f'(s ) > 1. Now /'(1/p) 
= (r + 1)(?« For definiteness we shall assume that 

1 r 

(3.2) q > — — , that is, p < —— * 

r + 1 r + 1 

In this case x < 1/p. 

We first show that the root x can be found by a simple iteration 
process. Note that, if s < 1/p and f(s) > s, then necessarily x > $, 
so that such a value of s serves as a lower bound for x. However, 
since f(s) is increasing, the inequality s < x implies f(s) < fix) = x. 
On the other hand, f(s) > s, so that the value/(s) is a better approxi¬ 
mation to x than s. In this way we can improve any given approxi¬ 
mation. For example, start with the approximation s 0 = 1. Then 
the value si = f(s 0 ), or 

(3.3) Si - 1 + qp r , 

is a better approximation, and certainly Si < x. Continuing in this 
way, we get a sequence s* defined by =/(«*); clearly the Sk 
increase monotonically, and their limit is a root of the equation s = f(s). 
Since Sk < x for all k and x is the smallest root, we see that the root x 
can be obtained as the limit of the sequence s^ 

In practice the approximation (3.3) is usually sufficient. In fact, 
the second approximation s 2 differs from s x only by terms of the order 
of magnitude rq 2 p 2r , which are usually very small. For an estimate 
of the error involved in (3.3) it suffices to find an upper bound for x y 
that is, any number s such that f(s) < s. Now put 

(3.4) 5=1 + qp r ( 1 + c). 

Since 1 + t < e l for all positive t , we find that 

(3.5) /(a) = 1 + qp r s r+l < 1 + qp r e^ 1)qpfa ^\ 

In order for (3.4) to give an upper bound to x it suffices therefore that 
the right side of (3.5) be smaller than the right side of (3.4), or that 

(3.6) (r + 1)^(1 + 6) < log (1 + €). 



13.3] 


NUMERICAL ESTIMATES 


271 


This inequality is certainly satisfied if we choose e > 0 so that the 
left side becomes e/2 < log (1 + e). Simple arithmetic shows that (3.6) 
holds if we put 

2 (r + 1 )qp r 

(3.7) 6 =---—- 

1 - 2(r + 1 )qp T 

This quantity is positive, since for fixed r the maximum value of qp r 
is attained for p = r/(r -f 1) so that (r + 1 )qp r < 1/2. The true 
value of the root x lies between the approximations (3.3) and (3.4) with 
e given by (3.7). For p = 0.5, r = 10 we have Si = 1 + 2” u = 
1.000488, while (3.4) leads to the upper bound 1.000493. In general, 
the approximation Si is better than the rough estimate (3.4). 

We now turn to an estimate of the error involved in (1.13). We 
know from equation (7.8) of Chapter 11 that the exact expression of 
p n is the sum of r terms with each root of (1.10) contributing one term. 
The contribution A k of the root s k is, of course, given by the right hand 
in (1.13) when x is replaced by s k . Thus 


(3.8) 


A k = 


1 ~ psjc 


1 


(r +1 - rs k )q s* n+1 


Now we found that the r roots s k of (1.10) are also roots of (3.1). Let 
s k | be the absolute value of a given non-positive root s k . Then 


z = 
z — 


1 + qp r s r+l | < 1 + qp r z r+l . Hence f(z) > 2 , so that z cannot lie 
between x and 1/p. Since we know that z > x, this means that all roots 
of (1.10) except x are larger in absolute value than 1/p. To estimate the 


first fraction in (3.8) we observe that 


1 

- Sk 

V 


r + 1 


~ s k 


is the 


ratio of the distances of the root s k to the points 1/p and (r + l)/r 
of the real axis. In view of (3.2) the point (r + l)/r lies inside the 
circle \ s \ = 1/p, and the point s k outside. The maximum value of the 
ratio for all 8 with \ s \ > 1/p is attained for s = —1/p. Using (3.2), 
it follows then that 


2 p 


,«+1 


2p 


,»+ 2 


(3.9) \A k \ < 

(r + 1 + r/p)q rq (1 + p) 

Therefore the error committed in (1.13) is smaller in absolute value than 

2 (r - l)p n + 2 

(3.10) —- - - 

rq (1 + p) 

For numerical values see Table 2 of section 1. 



272 


RUNS AND RENEWAL THEORY 


[13.4 


4. The Renewal Equation 

We now proceed to extend the basic results of Chapter 12, section 
3, to a more general case which frequently occurs in applications. 
Examples will be given in the next section; here we proceed in a formal 
manner. 

Suppose there are given two sequences of hounded non-negative numbers 
jo„} and {&„} (n = 0, 1, 2, ...) with a 0 5 ^ 1. A third sequence u n is 
defined by the recursive relations 

(4.1) u„ = 5„ + (ao u n + + • • • + a n Uo). 

In the notations of Chapter 11, section 2, this equation becomes 

(4.2) {u n \ = (M + { a n } * { u n }. 

Solving (4.1) successively, we get 


«o = bo/(l - «o), wi = (l>i + «iWo)/(1 - ao), etc., 

so that no problem as to the existence of a unique solution j u n } arises. 
We are interested in the behavior of \u n ) as n —»<», a problem to 
which a great number of papers (mostly of controversial nature) have 
been devoted. 

Note that (4.1) reduces to (3.3) of Chapter 12 if we put 
b n = 0, a n = f n for n > 0, and a 0 = 0, and b 0 = 1. Thus the renewal 
equation (4.1) contains the fundamental equation of recurrent events 
as special case. However, we can derive all properties of the renewal 
equation from the results on recurrent events. Once more we pass to 
generating functions. Since the coefficients a n and b n are bounded, 
A(s) = So n s" and B(s) = 2 b n s n converge at least for ( s | <1. The 
domain of convergence of U(s) = 2u„s n remains to be investigated. 
From (4.2) we get U(s ) = B(s) + A(s)U(s) or 


(4.3) 


U(s) = 


B(s) 

1 - A(s) 


For B(s) m 1 this reduces to (3.5) of Chapter 12. The essential differ¬ 
ence is that now {a„J is not necessarily the distribution of a recurrence 
time, so that A(s) can be larger as well as smaller than 1. We shall 
investigate only the case where 

(4.4) 

is finite. 


B( 1) = 26fc 



13.4] 


THE RENEWAL EQUATION 


273 


We shall say that we have the periodic case if there exists an integer 
X > 1, such that all ak except, perhaps, a\, a 2 \, a 3 \, ... vanish. Then 
A(s) is a power series in s\ The largest integer X with the said property 
is called the period . 

Theorem 1 . If {a*,} is a probability distribution (i that is, if A(l) = 1), 
then, except in the periodic case, 


(4.5) 


u n 


BO) 


where \i = A'(l) = 2na n . In particular, u n —> 0 if 2 na n diverges. 

Proof. Let v n be the coefficient of s n in 1/{1 — -4(5)). We know 
from theorem 3 of Chapter 12 that v n —»l//i. Now 

(4.6) u n = v n b 0 + v n _ibi + • • + Vq b n . 


For every fixed k the term v n ^kbk tends to bk/n as n —»oo. Moreover, 
the v n are bounded. It follows that, for N sufficiently large, u n differs 
arbitrarily little from 

(4.7) u n ' = v n b 0 + v n _ibi + • • • + 


and Un —> (bo + * • • + &v)/m which in turn differs arbitrarily little 
from 

Theorem 2. If 2ajt < 1, then u„ —> 0 so fast that the seines 
2 u n = B(l)/(1 — ^4(1)} converges. 

Proof . From (4.3) we have the expansion 

U(s) = B(a) {1 + A(s) + A 2 (s) + A 3 (s)-- ■} 

valid for all s for which | A (s) | <1 and therefore at least for | s | < 1. 
The right side can be rearranged into a power series in s, converging 
at least for | s | < 1, and this proves the theorem. 

Theorem S. If hak > 1, then there exists a unique positive root x < 1 
of the equation A (s) = 1. Then in the non-periodic case 


(4.8) 


u n x n 


A'(x) ’ 


so that u„ is of the order of magnitude x " and hence increases expo- 



274 


RUNS AND RENEWAL THEORY 


[13.4 


nentially. The value A f (x) is finite, since A(s) converges for | s | < 1 
and x < 1. 


Proof. 
series in 

(4.9) 


It suffices to apply theorem 1 to the coefficients of the power 


U(xs) = 


B(xs) 

1 — A(z $) 9 


namely, \u n x n \, \b n x n }, and { a n x n }. 

There remains the periodic case where A (s) = Sa n \s nX is a power 
series in s x . In this case the coefficients u n exhibit a certain periodicity, 
and we divide them into groups of equal phase, \u 0) U\, u 2 \, u 3 x, ... 
{u\ 9 u\+ 1 , U 2 \+iy . ..}, • •{^x—i, U 2 \-i, ^3X—i, • • •}. It is ob- 

vious from (4.3) that the coefficients u n \ depend only on b 0 , b\ y b 2 \ y ... 
but not on. the b* with k not divisible by X. This leads us to represent 
U(s) and B(s) as the sum of X power series in s x 

(4.10) U(s) = U 0 (s) + slh(s) + • • • + 

B(s) = B g (s) + sBi(s) + • • • + s x ~ l B\~_ i($), 

where 

oo oo 

(4.11) 17,(8) = £ u nX+j s n , Bj(s) = E 6„x+,s”. 

71=0 71=0 

Then, from (4.3) for j = 0, 1, • • •, X — 1, 


(4.12) 


Uj(s) - 


BM 

1 -A{8) 


Here all functions are power series in s x , and the preceding theorems 
apply after the change of variables s x = t. In theorem 1 we had 
n = A'( 1), the differentiation being with respect to s. In the present 
case we must of course replace y by 


(4.13) 

We have thus 


dA( 1) _ M 
dt ~ X 


Theorem 4- In the periodic case with period X the sequence {u n ) is 
asymptotically periodic; if A(l) = 1, each of the X subsequences {w nX +;! 
has a limit 

,, XB,(1) 

(4.14) Inn u„x+/ =- 

n —ja 

where Bj( 1) == b >• + b\ + j + b 2 \ + j + b 3 \ + j + .... 



13.5] 


EXAMPLES 


275 


5. Examples 

(a) Self-renewing Aggregates. We consider an electric bulb, fuse, or 
other piece of equipment with a finite life span. As soon as the piece 
fails, it is replaced by a new piece of like kind, which in due time is 
replaced by a third piece, and so on. We assume that the life span 
is a random variable which ranges only over multiples of a unit time 
interval (year, day, or second). Each time unit then represents a 
trial with possible outcomes “replacement” and “no replacement.” 
The successive replacements may be treated as recurrent events. If 
a n is the probability that a new piece will serve for exactly n time 
units, then {a n } is the distribution of the recurrence times. If it is 
certain that the life span is finite, then 2a n = 1 and the recurrent 
event is certain. Usually it is known that the life span cannot exceed 
a fixed number m, and in this case the generating function A(s) is a 
polynomial of a degree not exceeding m. 

So far we have considered only a single piece and the line of its direct 
descendants (replacements). We now turn to the study of a whole 
population of pieces of like kind, each of which is replaced as soon as 
it expires, so that the replacements keep the population size constant. 
The term “self-renewing aggregates” is used to describe this situation 
(although the meaning of the prefix “self” is not apparent). Of course, 
in special cases a self-renewing aggregate may well consist of people. 

Suppose that the initial population (at time 0) contains exactly 
Vk elements (pieces or people) of age fc, so that N = Sv* is the original 
population size. Each of these N elements originates a line of descend¬ 
ants, and at any time n there is a certain probability that a replacement 
is required in this line. The sum of these probabilities for all N ele¬ 
ments is the expected number u n of replacements at time n . In the 
present case it is natural to put uo = 0. Without loss of generality we 
may assume that the life span is necessarily positive, that is, a 0 = 0. 

To see that the renewal equation applies, note that the replacements 
at time n are of two kinds. First, an element to be replaced may have 
been installed as a new element at time j (1 < j < n). At time n such 
an element has age n — j , and the expected number of replacements 
of this kind is clearly w ; a n __y. Adding over all possible j> we get 
^lOn-i + u 2 a n _ 2 + • • • + u n _ i«i (remember that we have u 0 — a 0 
= 0). This accounts for the second term on the right in the renewal 
equation (4.1). Second, the element to be replaced at time n may be 
of the initial population and have been of age fc > 0 at time 0 (that is, 
of age n + k at the moment of expiration). The probability of a life 
span exceeding k is r* = ait+i + ajb +2 +- To find the prob¬ 

ability that an element of age fc will expire after exactly n years we 
require the conditional probability of a life span fc + n on the hypoth- 



276 


RUNS AND RENEWAL THEORY 


[13.6 


esis that the life span exseeds k. This conditional probability is 
obviously a n + k /r&. Now at time 0 there were v k elements of age fc. 
The expected number of those among them which expire at time n is 
VkOn+k/ric- The expected number of elements of the original population 
which expire at time n > 1 is therefore 


(5.1) 


00 


bn = £ 

k=Q 


Vkdn+k 

Th 


Adding this term to the one previously found, we see that for n > 1 
u n satisfies the renewal equation (4.1). To have (4.1) true for all n 
we put b 0 = 0. We can then apply our theorems 4 to find the asymp¬ 
totic behavior of the renewal coefficients u n . 

It is also easy to obtain the age distribution at time n. Let v k (n) 
be the expected number of elements of age k at time n [which implies 
v k (0) = v k ], Then clearly 


(5.2) 


v k (n) = u n _ k r k if k <n 


v k - n r k 

v k {n) =- if k > n. 

r k -n 

In the non-periodic case we know that u n —> B(l)/n = AT//x as n —> 
and it follows from (5.2) that v k (n) Nr k /u- Hence, in the non¬ 
periodic case, there is a stable limiting age distribution: in the limit the 
expected number of elements of age k is Nr k /u, where N is the (con¬ 
stant) population size, and p, = Sr* the mean duration of life (if 
/x = oo, then the population ages indefinitely). The basic fact is that 
the limiting age distribution is independent of the initial age distribution 
and depends only on the mortality distribution [a n \ (cf. problems 9 
and 10). 

As a numerical illustration consider a population of 1000 elements 
with the age distribution v 0 = 500, v x = 320, v 2 = 74, i> 3 = 100, 
= 6. Let the life-time distribution be given by a\ = 0.20, a 2 = 0.43, 
a 3 = 0.17, a 4 = 0.17, a 5 = 0.03 (no element can attain an age exceed¬ 
ing 5). Here r 0 = 1, r x = 0.80, r 2 = 0.37, r 3 = 0.20, r 4 = 0.03, 
r 6 = 0, whence v 0 /r 0 = 500, Vi/r x = 400, v 2 /r 2 = 200, v 3 /r 3 = 500, 

4 For further properties cf. W. Feller, Fluctuation Theory of Recurrent Events, 
Transactions of the American Mathematical Society , vol. 67 (1949), pp. 98-119. A 
great many papers treat special cases. For numerical applications cf. for example 
N. R. Campbell, The Replacement of Perishable Members of a Continually Operat¬ 
ing System, Supplement, Journal of the Royal Statistical Society , vol. 7 (1941), pp- 
110-130, or D. J. Bishop, The Renewal of Aircraft, Ministry of Aircraft Production, 
Aeronautical Research Committee , Report and Memoranda no. 1907 (6342), 1942. 



13.6] 


PROBLEMS fOR SOLUTION 


277 


Vt/r* — 200. Hence from (5.1) b t = 397, b 2 = 332, b 3 = 159, b 4 — 97, 
i > 5 = 15, and from (4.3) we get 


(5.3) 


U(s) = s 


397 + 332s + 159s 2 + 97s 3 + 15s 4 
1 - 0.20s - 0.43s 2 - 0.17s 3 - 0.17s 4 - 0.03s® 


The roots of the denominator are Sj = 1, s 2 = — s 3 = — 5 , s 4 = 2 i, 
S 5 = — 2 i, and hence 

1250s 972s 38s 78,225s 2 + 22,125s 

~ 3(1 - *). 61(i"T3V5) 87(1 + «/5) 5307(1 + >/4) ’ 

Expanding each term into a geometric series, we get exact expressions 
for u n . 

The age distributions Vk(n) are given in the following table. 


n 


k 











0 

1 

2 

3 

4 

5 

6 

7 

CO 

0 

500 

307 

411.4 

412 

423.8 

414.3 

417.0 

416.0 

416.7 

1 

320 

400 

317.6 

320.1 

329. G 

339.0 

331.5 

333.6 

333.3 

2 

74 

148 

185 

146.9 

152.2 

152.4 

156.8 

153.3 

154.2 

3 

100 

40 

80 

100 

79.4 

82.3 

82.4 

84.8 

83.3 

4 

6 

15 

6 

12 

15 

11.9 

12.3 

12.4 

12.5 


(b) Population Theory. This theory is analogous to renewal theory 
except that the population size is variable and that female births play 
the role of replacements. The essential novelty is that a mother can 
have zero, one, or more daughters, so that lines may become extinct 
or branch. We now define a n as the probability of a newborn female 
to survive and at age n give birth to a female child (the dependence 
on the number and ages of previous children is neglected). Then 2a n 
is the expected number of daughters, and hence all three possibilities 
2a n < 1 , Sa n = 1, 2a n > 1 are now possible. The preceding argument 
applies with this obvious modification. 

6. Problems for Solution 

1. Find an approximation to the probability that in 10,000 tossings of a coin 
the number of head runs of length 3 will lie between 700 and 730. 

2. In a sequence of tossings of a coin let £ stand for the pattern HTH. Let r n 
be the probability that 6 does not occur in n trials. Verify the relation (7.15) and 
hence (7.14) of Chapter 11. 



278 


RUNS AND RENEWAL THEORY 


[13.6 


3. In example (2.6) show that the expected duration of the game is mim 2 /(mi + M 2 ), 
where mi and m 2 are the mean recurrence times for success runs of length r and failure 
runs of length p, respectively. 

4. The possible outcomes of each trial are A, B f and C; the corresponding 
probabilities are a, 0 , 7 (a + 0 -f 7 = 1 ). Find the generating function of the 
probability that in n trials there is no run of length r: (a) of A's, ( 6 ) of A’s or B’ s, 
(c) of any kind. 

5. Continuation. Find the probability that the first A-run of length r precedes 
the first B-run of length p and terminates at the nth trial. [Note that this problem 
does not reduce to that of example ( 2 . 6 ) with p = a/(a + 0), q = 0/(a + /?).] 

6 . In a sequence of Bernoulli trials let qic, n be the probability that exactly n 

success runs of length r occur in k trials. Using problem 9 of Chapter 12 show that 
the generating function Qk(x) = is the coefficient of s* in 

_ 1 - p r s r _ 

1 — s -f qp r s r+1 — (1 — ps)p r s r x 

Show, furthermore, that the root of the denominator which is smallest in absolute 
value is «i « 1 -h qp r ( 1 — x). 

7. Continuation. The Poisson distribution of long runs. 5 If the number k 
of trials and the length r of runs both tend to infinity, so that kqp T —> X, then the 
probability of having exactly n runs of length r tends to e~*\ n /n!. 

Hint: Using the preceding problem, show that the generating function is 
asymptotically jl + qp r (l — x) } 

The following problems refer to the renewal theory , specifically to example (5.ft). 

8 . Constancy of the population. For the quantities ( 5 . 2 ) prove by induction 
that ^2 Vk ( n ) = N for every n. 

9. If the mortality distribution is given by pk = (with p + q = 1), find 

Un and the limiting age distribution, assuming that the original population consists 
of N elements aged zero. 

10. An age distribution is called stable if Vk ( n ) does not depend on n. Show that 
this is the case if, and only if, Vk — Crk, where C is a constant. 

5 This problem is best solved using the continuity theorem of Chapter 11 , sec¬ 
tion 8 . The theorem was proved by different methods by von Mises. 



CHAPTER 14 


RANDOM WALK AND RUIN PROBLEMS 

1. General Orientation 

The main part of this chapter is devoted to certain problems con¬ 
nected with Bernoulli trials in which the probabilities of success and 
failure are p and q> respectively. For simplicity and clarity of language 
we shall formulate the problems and theorems in terms of two intuitive 
models. 

First, we shall consider the familiar gambler who wins a dollar for 
each success and loses a dollar for each failure. We shall suppose that 
the gambler and his adversary own a total of a dollars and start with 
z and a — z dollars, respectively. The game continues until the 
gambler's capital either is reduced to zero or has increased to a, that 
is, until one of the two players is ruined. We are interested in the 
probability of the gambler’s ruin and the probability distribution of 
the duration of the game. This is the classical ruin problem. 

Physical applications and analogies suggest another intuitive inter¬ 
pretation. We imagine that the trials are performed at times t = 1, 

2, 3, ... and interpret their results in terms of the motion of a variable 
point or particle on the x-axis. At time t = 0 this particle has the 
position x = z, and at times t = 1, 2, 3, ... it moves a unit step to the 
right or left according to whether the corresponding trial results in 
success or failure. Thus the position of our particle at time n represents 
the gambler’s capital at the conclusion of the nth trial. The trials 
terminate when the particle for the first time reaches either x = 0 or 
x = a. We say that the particle performs a random walk. The limiting 
positions x = 0 and x = a are called absorbing bairiers; the gambler’s 
ruin is interpreted as absorption at x = 0. We say that our random 
walk is restricted to the possible positions x = 0,1, • • •, a; in the absence 
of absorbing barriers the random walk is called unrestricted. If 
p = q 1/2, the random walk is called symmetric. Physicists use 
the random-walk model as a crude approximation to one-dimensional 
diffusion processes and Brownian motion, where a physical particle is 
exposed to a great number of molecular collisions or shocks which 
impart to it a random motion. If there is a drift to the right, shocks 
from the left are more probable and we have p > q. 

279 



280 


RANDOM WALK 


[14.1 


So far we have merely described the classical ruin problem in a 
new terminology. However, the random-walk model leads to new prob¬ 
lems which are also suggested by physical analogies. Thus, instead of 
absorbing barriers we may consider other boundary conditions. For 
example, we may imagine a reflecting wall at x = with the property 
that if the particle starts from x = 1 and moves to the left, it is re¬ 
flected at x = x /2 and returns to x = 1 instead of reaching x = 0. In 
other words, whenever the particle is at the position x = 1, then it 
has probability p to move a unit step to the right and probability q to 
stay. We describe this condition by referring to a reflecting barrier 
atx = %. In gambling terminology this corresponds to a convention 
that whenever the gambler loses his last dollar it is generously replaced 
by his adversary so that the game can continue. A reflecting barrier 
at x = a — 3^ is defined in a similar way. With two reflecting barriers 
the random walk never terminates. We may also consider random 
walks for which one boundary acts as a reflecting, the other as an 
absorbing barrier. Finally, there exist elastic barriers, which are 
partly absorbing, partly reflecting. 

Consider next an unrestricted random walk with the possible posi¬ 
tions x = 0, dbl, ±2, .... We may inquire as to the probability that 
the particle eventually returns to its initial position and, if it does, 
that the first return occurs at the nth step. This is the problem of 
recurrence times which was solved and discussed at length in Chapter 12 
[example (3 .a) and section 5]. We found there that the fluctuations 
of these recurrence times exhibit several unexpected properties and 
differ from the more familiar type of fluctuation phenomena described 
by the central limit theorem. 

Instead of inquiring as to the return to the initial position we may 
also ask for the probability that the particle reaches a preassigned 
position x for the first time at the nth step. This is the problem of 
first passage times, which is related to the ruin problem. In fact, sup¬ 
pose that a gambler starts with an initial capital z > 0 and plays against 
an infinitely rich adversary. (This is the limiting case of the classical 
ruin problem when a —»oo.) The gambler's capital is represented by a 
particle performing a random walk, and the gambler is ruined when 
the particle reaches the position x = 0 for the first time. The prob¬ 
lem of the duration of this game is equivalent to the problem of the 
first passage time through the origin in an unrestricted random walk 
starting at z. 

Even though we have used various intuitive descriptions, all prob¬ 
lems described are obviously concerned with sums of mutually inde¬ 
pendent random variables X x , X 2 , ... which assume the values +1 



14.1] GENERAL ORIENTATION 281 if* 

and —1 with probabilities p and g, respectively. Among the many 
generalizations of random-walk problems we shall here consider only 
two. 

First, instead of Bernoulli trials we may consider arbitrary trials. 
This means that the gambler’s gain X k at the fcth trial is now a random 
variable with an arbitrary distribution, or, in random-walk termi¬ 
nology, that the particle may change its position in jumps which are 
not necessarily of magnitude ±1. We formulate the corresponding 
ruin or absorption problem as follows. The particle starts at z > 0 
and the process ends when for the first time it jumps to a position 
x < 0 or x > a, that is, when for the first time 1 the sum Xi + • • • + X k 
is either < —z or > a — z. Required are the corresponding probabil¬ 
ities and the probability distribution of the duration of the game. 
This problem has attracted widespread interest in connection with 
sequential sampling. There the X k represent certain characteristics of 
samples or observations. Measurements are taken until a sum 
Xi -f- • • • + X k falls outside two preassigned limits (our —z and 
a — z). In the first case the procedure leads to what is technically 
known as rejection , in the second case to acceptance. The first sampling 
procedure of this kind was described by W. Bartky 2 ; the general 
theory was outlined by A. Wald, to whom the above formulation is 
due. 3 In section 8 the methods of ordinary random walks are adapted 
to this more general case. However, it is more natural to consider the 
generalized problem as a special case of Markov chains, and a full 
treatment is postponed to Chapter 15. 4 It must be understood that 
all our random walks can be treated as special Markov chains, and 
that the present chapter serves mostly as an introduction to the next. 

A second generalization consists in letting the particle perform a 
randonAvalk in two or more dimensions. For example, in two dimen- 

1 For an interpretation in hutting language it is, of course, necessary that the 
gambler’s initial capital be at least z plus the maximum possible loss in a single 
trial; similarly the adversary’s initial capital must be at least a plus his maximum 
single loss. 

2 W. Bartky, Multiple Sampling with Constant Probability, Annals of Mathe¬ 
matical Statisticsj vol. 14 (1943), pp. 363-377. 

3 A. Wald, On Cumulative Sums of Random Variables, Annals of Mathematical 
Statistics , vol. 15 (1944), pp. 283-296. The methods described in the present book 
arc different from Wald’s. Cf. also Wald’s book, Sequential Ajialysts , John Wiley 
& Sons, New York, 1947. 

4 In the theory of sequential sampling it has become usual to consider random 
walks in which the particle has probability p to move in the direction of the posi¬ 
tive x-axis, and probability q to move in the direction of the positive y-axis. Most 
of the qualitative results in this direction follow from the results of Chapter 15, 
section 2, example IX. 



282 


RANDOM WALK 


[14.2 


sions we consider the regular net formed by the lines x = 0, del, 
=fc2, ... and y = 0, dbl, d=2, .... The particle moves in unit steps, but 
from each position it has the choice of four possible directions. An 
interesting difference between random walks in two and in three 
dimensions will be discussed in section 7. 

The discussion of the various problems will proceed as follows. In 
section 2 the probability of the gambler's ruin is derived and various 
implications of the solution are discussed. In section 3 it is shown 
how the expected value of the duration of the game can be derived in 
an elementary way. In sections 4 and 5 we turn to the more delicate 
problem of the probability distributions of the duration of the game 
and of first passage times. We use the method of difference equations 
because of its intrinsic interest and because of its intimate connections 
with physical diffusion theory. An alternative derivation of the results 
is outlined in problems 8-13. The random walk with reflecting barriers 
is not considered in this chapter but will be treated by the more 
appropriate methods of Markov chains (Chapter 16, section 3; cf. also 
problem 13). 

In section 6 we pass to the limit of a continuous chance process and 
discuss the connection of random walks with diffusion theory. In 
section 7 random walks in the plane and space are considered. Finally, 
section 8 is devoted to the generalized random walk connected with 
sequential sampling. 

2. The Gambler’s Ruin 

We consider the problem stated at the opening of the present chap¬ 
ter. Let q z be the probability of the gambler's ultimate 6 ruin, and p z 
the probability of his winning. In random-walk terminology q z and p z 
are the probabilities that a particle starting at z will be absorbed at 
x = 0 and x = a, respectively. We shall show that p z + q z = 1, so 
that we need not consider the possibility of an unending game. 

After the first trial the gambler's fortune is either z — 1 or z + 1, 
and therefore we must have 

(2.1) q, = PQz+i + qqi-i 

provided 1 < 2 < o — 1. For 2 = 1 the first trial may lead to ruin, 

6 Strictly speaking, the probability of ruin is defined in a sample space of infi¬ 
nitely prolonged games. However, we can work with the sample space of n trials. 
The probability of ruin in less than n trials increases with n and has therefore a 
limit. We call this limit “the probability of ruin.” All probabilities in this chapter 
may be interpreted in this way without referring explicitly to infinite sample spaces 
(cf. the introduction to Chapter 8). 



14.2] 


THE GAMBLER’S RUIN 


283 


and (2.1) is to be replaced by q x = pq 2 + q. Similarly, for z = a — 1 
the first trial may result in victory, and therefore q a ~i = qq a - 2 . To 
unify our equations we shall define 

(2.2) qo — 1, q a — 0. 

With this convention the probability q z of ruin satisfies (2.1) for 
z = 1 , 2, • • •, a — 1 . 

Equation (2.1) is a difference equation , and (2.2) represents the bound¬ 
ary conditions on q z . We shall derive an explicit expression for q z by the 
method of 'particular solutions , which also will be used in more general 
cases. 

Suppose first that p 9 ^ q. It is easily verified that the difference 
equation (2.1) admits of the two particular solutions q z = 1 and 
q z = ( q/pY • It follows that for arbitrary constants A and B the 
sequence 

(2.3) q z = A + B Q* 

represents a formal solution of (2.1). We wish to adjust the constants 
A and B so that the boundary conditions (2.2) will be satisfied. This 
means that A and B must satisfy the two linear equations A + B = 1 
and A + B(q/p) a = 0. Thus 


(2.4) 


(q/v) a - (q/ pY 

(q/p) a - 1 


is a formal solution of the difference equation (2.1), satisfying the 
boundary conditions (2.2). In order to prove that (2.4) is the required 
probability of ruin it remains to show that the solution is unique. In 
other words, we have to prove that all solutions of (2.1) can be written 
in the form (2.3). Now, given an arbitrary solution of (2.1), the two 
constants A and B can be chosen so that (2.3) will agree with it for 
z = 0 and 2 = 1. However, from these two values all other values 

can be found by substituting in (2.1) successively 2 = 1,2,3,- This 

means that two solutions which agree for 2 = 0 and 2 = 1 are identical, 
and hence that every solution is of the form (2.3). 

Our argument breaks down if p = q = 1/2, for then (2.4) is meaning¬ 
less. This is due to the fact that in this case the two formal particular 
solutions q z = 1 and q z = (q/p) z are identical. However, we now have a 
second formal solution in q z = 2 , and therefore q z = A + Bz is a 
solution of (2.1) depending on two constants. In order to satisfy the 



RANDOM WALK 


284 


[14.2 


boundary conditions (2.2) we must put A = 1 and A + Ba = 0. 
Hence 

(2.5) = 1 - 

a 


[The same numerical value can be obtained formally from (2.4) by 
finding the limit as p —»1/2, using I/Hospital’s rule.] 

We have thus proved that the required probability of the gambler's 
ruin is given by (2.4) if p q, and by (2.5) if p = q = 1/2. The 
probability p z of the gamblers winning the game equals the probability 
of his adversary’s ruin, and is therefore obtained from our formulas on 
replacing p, q, and z by q, p , and a — z> respectively. It is readily seen 
that p z + q 9 = 1, as stated previously. 

We can reformulate our result as follows: Let a gambler with an 
initial capital z play against an infinitely rich adversary who is always 
willing to play, while the gambler has the privilege of stopping at his 
pleasure. The gambler adopts the strategy of playing until he either loses 
his capital or increases it to a {or a net gain a — z). Then q z is the 
probability of his losing and 1 — q z the probability of his winning. 

Under this system the gambler’s ultimate gain or loss is a random 
variable G which assumes the values a — z and —z with probabilities 
1 — q z and q zy respectively. The expectation of gain is found to be 

(2.6) E{G) = a(l - q 9 ) - 2 . 

Introducing the value q z from (2.5), it is found that, if p = q = 1/2, 
then E(G) = 0. Conversely, it follows from (2.6) that E{G) = 0 
implies (2.5) and hence that p = q. This means that, with the system 
described, a “fair” game remains fair, and no “unfair” game can be 
changed into a “fair” one. 

From (2.5) we see that in the case p — q a player with initial capital 
z = 999 has a probability 999/1000 to win a dollar before losing his 
capital. With q = 0.6, p = 0.4 the game is unfavorable indeed, but 
still the probability (2.4) of winning a dollar before losing the capital 
is about 1/3. In general, a gambler with a relatively large initial 
capital z has a reasonable chance to win a small amount a — z before 
being ruined. 6 

• A certain man used to visit Monte Carlo year after year and was always suc¬ 
cessful in recovering the costs of his vacations. He firmly believed in a magic 
power over chance. Actually his experience is not surprising. Assuming that he 
started with ten times the ultimate gain, the chances of success are nearly 9/10. 
The probability of an unbroken sequence of ten successes is about (1 — 1/10) 10 
fin er 1 f & 0.37. One failure would, of course, be blamed on an oversight or momen¬ 
tary indisposition. 



14.2] 


THE GAMBLER’S RlIlN 


285 


Let us now consider the effect of changing stakes. If the initial 
capital of the player and of his adversary are z and a — z, respectively, 
then the probability of the player’s ruin is given by (2.4). Suppose 
now that the unit is changed from a dollar to a half-dollar. This means 
simply that in (2.4) we must replace z by 2 z and a by 2 a. The new 
probability of ruin is therefore 

-tn * (?/p) 2a - ( 7 /p) 2 * ( 7 /p ) 0 + ( 7 /p)* 

(2.7) qz* = — —- ---= 7 * • 7 - 7 - 77 - 7 — - 

( 7 / p) - 1 (7/p) + 1 

If q > p, then the last fraction is greater than unity and hence 
q* > q z . Hence , if the stakes are doubled while the initial capitals remain 
unchanged , then the probability of ruin decreases for the player whose 
probability of success p < 1 /2, and increases for the adversary {for whom 
the game is advantageous since q > p). A similar statement holds true 
in general when the stakes are increased (not necessarily doubled). 
Suppose, for example, that our player plays on the unfavorable side 
(<q > p) and owns 900 dollars which he is willing to risk in order to win 
100 dollars. If he stakes 1 dollar at each trial, the probability of ruin 
is given by (2.4) with 2 = 900, a = 1000. If he stakes 10 dollars at 
each trial, we must put z = 90 and a = 100. In general, if k dollars 
are staked at each trial, we find the probability of ruin from (2.4), 
replacing z by z/k and a by a/k, and the probability of ruin decreases 
as k increases. The gambler therefore minimizes the probability of 
ruin by selecting the stakes as large as is consistent with his goal of 
gaining an amount fixed in advance. In this sense, the popular doubling 
system is optimal. In fact, suppose a player sets out to win an amount 
c (which should be reasonably small in comparison with his initial 
capital). The optimal stake at the first trial is c, for we have just shown 
that a smaller stake would increase the probability of ruin, and the 
same is true of a larger stake since the possibility of a larger gain is 
necessarily compensated by an increased probability of ruin. If the 
first trial is successful, then his goal is achieved and he leaves the game. 
Otherwise, he has to recover the loss c and win additional c dollars. 
Ilis new goal is therefore 2c, and hence the doubling of the stake. 

These results are classical. It has been contended that every “un¬ 
fair” bet is unreasonable. If this were to be taken seriously, it would 
mean the end of all insurance business. Actually no theorem of prob¬ 
ability suggests that a careful driver who insures against liability at 
average rates acts unreasonably, but he plays a game which is tech¬ 
nically “unfair.” 

If in our formulas we pass to the limit as a —> 00 , we expect to get 
the probability of ruin in a game against an infinitely rich adversary. 



286 


RANDOM WALK 


[14.3 


With an initial capital z this probability should be 1 if q > p and 
(q/pY if q < p. Note, however, that the case a = oo (random walk 
on a semi-infinite line) is defined on its own merits and not as a limit¬ 
ing case. It will be seen in section 4 that the result of a direct treat¬ 
ment agrees with the described formal passage to the limit. 

3. Expected Duration of the Game 

The probability distribution of the duration of the game will be 
deduced in the following sections. However, its expected value can 
be derived by a much simpler method which is of such wide applica¬ 
bility that it will now be explained at the cost of a slight duplication. 

We are still concerned with the classical ruin problem formulated 
at the beginning of this chapter. We shall assume as known that the 
duration of the game has a finite expectation D z . A rigorous proof will 
be given in the next section. 

The argument which led to the difference equation (2.1) and the 
boundary conditions (2.2) shows directly that the expected duration 
D z satisfies the difference equation 

(3.1) D z = pD z+1 + qD z —i + 1, 0 < 2 < a 

with the boundary conditions 

(3.2) D 0 = 0, D a = 0. 

The appearance of the term 1 makes the difference equation (3.1) 
non-homogeneous. If p q, then D z = z/{q — p) is a formal solution 
of (3.1). It is readily seen that the difference A z of any two solutions 
of (3.1) satisfies the homogeneous equations A* = pA z+l + gA (? „ 1 , 
and we know already that all solutions of this equation are of the 
form A + B(q/p ) z . It follows that if p q all solutions of (3.1) are 
of the form 

(3.3) D x = —+ A + B (-) • 

q — p \p/ 

The values of the constants A and B follow again from the boundary 
conditions (3.2), according to which we must have A + B = 0 and 
A + B(q/p) a = —a/(q — p). Solving for A and B, we find 

(3. 4) o, = _!-?_. L zML. 

q - v q - v i - ( q/v) a 

Again the formula breaks down if q = p = 1/2. In this case we must 
replace e/(q — p) by —z 2 , which is now a solution of (3.1). It follows 



14.3] 


EXPECTED DURATION OF THE GAME 


287 


that when p = q = 1/2 all solutions of (3.1) are of the form 
D z = —z 2 + A + Bz. The required solution D z which satisfies the 
boundary conditions (2.2) is then 

(3.5) D g = z(a — z). 

The expected duration of the game in the classical ruin problem is given 
by (3.4) or (3.5), according as pj*q or p = q = 1/2. 

It should be noted that this duration is considerably longer than 
one would naively expect. If two players with 500 dollars each toss a 
coin until one is ruined, the average duration of the game is 250,000 
trials. If a gambler has only one dollar and his adversary 1000, the 
average duration is 1000 trials. 


TABLE 1 

Illustrating the Classical Ruin Problem 


V 

Q 

z 

a 

Probability of 

Expectation of 

Ruin 

Success 

Gain 

Duration 

0.5 

0.5 

9 

10 

0.1 

0.9 

0 

9 

.5 

.5 

90 

100 

.1 

.9 

0 

900 

.5 

.5 

900 

1,000 

.1 

.9 

0 

90,000 

.5 

.5 

950 

1,000 

.05 

.95 

0 

47,500 

.5 

.5 

8,000 

10,000 

.2 

.8 

0 

16,000,000 

.45 

.55 

9 

10 

.210 

.790 

-1.1 

11 

.45 

.55 

90 

100 

.806 

.134 

-76.6 

765.6 

.45 

.55 

99 

100 

.182 

.818 

-17.2 

171.8 

.4 

.6 

90 

100 

.983 

.017 

-88.3 

441.3 

.4 

.6 

99 

100 

.333 

.667 

-32.3 

161.7 


The initial capital is z. The game terminates with ruin (loss z) or capital a 
(gain a — z). 


Most interesting is the passage to the limit as a —> qo, which corre¬ 
sponds to a play against an infinitely rich adversary (cf. the concluding 
remark to section 2). From (2.4) and (2.5) we concluded that there 
is probability one of ruin if q > p, while if q < p (favorable game), 
the probability of ruin is (q/p) z - If p = q the duration of the game 


RANDOM WALK 


[14.4 


has infinite expectation . This is in accordance with the fact discussed 
in Chapter 12 that, if a coin is tossed until for the first time the num¬ 
ber of heads equals the number of tails, the game has infinite expected 
duration. For an unrestricted symmetric random walk this means 
that the time until the first return to the initial position has infinite 
expectation. Our new result states that the first passage time to any 
position (even’the adjacent ones) has infinite expectation. We shall 
see that a similar statement is true in the more refined diffusion theory. 
The reader is referred to Chapter 12 (section 5) for a discussion of 
several startling features of recurrence times, in particular the arc 
sine law. 

4. Generating Functions for the Duration of the Game and First 
Passage Times 

We shall use the method of generating functions to study the dura¬ 
tion of the game in the classical ruin problem or restricted random 
walk with absorbing barriers at x = 0 and x = a. The initial position 
is z (with 0 < z < a). Let w z , n denote the probability that the process 
ends with the nth step at the barrier x — 0 (gambler's ruin at the nth 
trial). After the first step the position is z + 1 or z — 1, and we con¬ 
clude that for 1 < z < a — 1 and n > 1 

(4.1) u Ztn+l = pu g+lfn + qu z _ Un . 

This is a difference equation analogous to (2.1), but depending on the 
two variables z and n. In analogy with the procedure of section 2 we 
wish to define boundary values u 0tn , *V n , and u Zt0 so that (4.1) becomes 
valid also for z = 1, z = a — 1, and n = 0. For this purpose we put 


(4.2) 

Mo.n = U a ,n = 0 

when 

n > 1 

and 




(4.3) 

Mo.o — 1, u*. 0 = 0 

when 

z > 0. 


Then (4.1) holds for all z with 0 < z < a and all n > 0. 

We now introduce the generating function 

(4.4) U t (s) - t 

n-0 

Multiplying (4.1) by s n+1 and adding for n = 0, 1, 2, ..we find for 
0 < z < a 


(4.5) 


UM = ps£7* +1 (s) + gsf7 z _j(8). 



14.4] GENERATING FUNCTIONS 289 

Moreover, equations (4.2) and (4.3) lead to the boundary conditions 

(4.6) Uo(s) = 1, Ua(8) = 0. 

Equation (4.5) is a difference equation analogous to (2.1), and the 
boundary conditions (4.6) correspond to (2.2). The novelty lies in 
the circumstance that the coefficients and the unknown U z (s) now 
depend on the variable s, but as far as the difference equation is con¬ 
cerned, s is merely an arbitrary constant. We can again apply the 
method of section 2 provided we succeed in finding two particular 
solutions of (4.5). It is natural to inquire whether there exist two 
solutions U z (s ) of the form U z (s) = X*(s). Substituting this expression 
into (4.5), we find that \(s) must satisfy the quadratic equation 

(4.7) A(s) = ps\ 2 (s) + qs, 
which has the two roots 


(4.8) Ms) 


1 + (1 — 4pqs 2 )^ 
2 ps 


X 2 (s) = 


1 — (1 — 4 pqs 2 )^ 
2ps 


(we take 0 < s < 1 and the positive square root). 

We have thus found two particular solutions of (4.5) and conclude 
as in section 2 that for two arbitrary functions A(s) and B(s) 

(4.9) U z (s) = A(s)M(s) + B(s)\ 2 *(s) 

is a solution of (4.5). If this solution is to satisfy the boundary 
conditions (4.6), we must have A(s) + B(s) = 1 and A(s)Xi°(s) 
+ B(s)\ 2 a (s) = 0. We find in this way 


(4.10) 


U z (s) 


Xi a (s)\ 2 *(s) - QX 2 «(s) 

M(s) - \ 2 a (s) 


Using the obvious relation Xi (s)\ 2 (s) = q/p , the last formula simplifies 
to 


(4.11) 


„ /x /jVV'W -x 2 a - z («) 

* S \p) Ai°(s) - X 2 °(s) 


This is the required generating function of the probability of ruin at the 
nth trial (absorption at x = 0). The corresponding generating function 
for the probability of absorption at x = a is obtained on replacing 
P) q, * by q,p and a - z, respectively. The generating function of the 
duration of game is, of course, the sum of the two generating functions. 



290 RANDOM WALK [14.5 

All results of the preceding two sections are contained in formula 

(4.11) . In particular, we have for the probability of ruin 

00 

(4.12) q z = 2 «*.» = 

n =0 

Now 1 — 4 pq = (p — q) 2 , and from (4.8) we find that when p > q 
we have Ai(l) = 1 and X 2 (l) = q/p , while Ai(l) = q/p and X 2 (l) = 1 
in the case q > p. Substituting into (4.11), we see that (4.12) reduces 
to (2.4). Similarly we get the expected duration D z as given in (3.4) 
by a simple differentiation. For p = q our expression becomes inde¬ 
terminate, but the formulas (2.5) and (3.5) follow by a passage to the 
limit as s —*1, using L’Hospital’s rule. In the next section we shall 
derive from (4.11) an explicit formula for u ZiTl . 

Our method applies also when a = oo, which corresponds to the case 
of a random walk with the single absorbing barrier at x — 0 (or play¬ 
ing against an infinitely rich adversary). We have now the sole 
boundary condition Uq(s) = 1. All solutions of (4.5) are of the form 
(4.9), but since Ai(s) > 1 and X 2 (s) < 1 for 0 < s < 1, we find that 
U z (s ) is unbounded unless A(s) = 0. Hence the required solution is 

(4.13) V z (s) = X 2 z (s). 

This is the generating f unction of the probability that } starting from z > 0, 
the particle will be absorbed atx = 0 exactly at the nth trial (in the absence 
of other barriers). It is also the generating function of the first passage 
time through x = 0 of a free particle starting at z > 0. In particular, 
for z = 1 we find that X 2 (s) is the generating function of the first 
passage time through the neighboring position to the left. The first 
passage time from z to 0 is the sum of the first passage times from z 
to z — 1, from z — 1 to z — 2, etc., and is therefore the sum of z mu¬ 
tually independent random variables each having the generating func¬ 
tion X 2 (s). This explains why 7*(s) is the 2 th power of a generating 
function. 

Substituting 5 = 1 into (4.13), we find the probability of ruin in 
the case of an infinitely rich adversary. It is {q/p) z or 1, according 
as q < p or q > p. 

If z < 0, the generating function for the first passage time through 
the origin is obtained from A 2 *(5) by interchanging p and q. An easy 
computation shows that this generating function is Xi z (s). 

* 5. Explicit Expressions 

We shall now derive an explicit formula for u z>n by expanding U z (s) 
into partial fractions. Formally, the expression (4.11) for U t (s) de- 

* Starred sections treat special topics and may be omitted at first reading. 



14.5] 


EXPLICIT EXPRESSIONS 


201 


pends on a square root, but in reality U z (s) is a rational function. In 
fact, expanding the expressions (4.8) according to the binomial theorem, 
we see that the difference X^s) — X 2 *(s) is a rational function in 5 
multiplied by (1 — 4 pqs 2 ) H ; this root appears as a factor in both the 
numerator and the denominator of (4.11), and hence U z (s ) is the ratio 
of two polynomials. The degree of the denominator is a — 1 or a — 2, 
according to whether a is odd or even; the degree of the numerator is 
a — 1 or a — 2, according to whether a — z is odd or even. In no case 
can the degree of the numerator exceed the degree of the denominator 
by more than one. Hence for n > 1 we can compute u Ztn from formula 
(7.8) of Chapter 11, provided only that all the roots of the denomi¬ 
nator are distinct. 

We could calculate the roots of the denominator and the correspond¬ 
ing coefficients p„ directly, but the algebra simplifies if we introduce 
a new independent variable </> by 


(5.1) 

From (4.8) we find 

(5.2) X li2 ( S ) = (0 


cos </> 


= 2(p<7) H s 




(cos (/> dr i sin <j>) 






and hence from (4.11) 

(5.3) U z (s) = (0 


q\ z/2 sin (a — z)<f> 


sin a<t> 


The roots of the denominator are obviously = 0, x/a, 2ir/a> .... The 
corresponding values of s are 

( 5 . 4 ) «, = —--* 

2{pq) 2 cos vw/a 

We get all possible values for s V} putting v = 0, 1, • • *, a. However, 
to v = 0 and v = a there correspond the extraneous values </> = 0, 7r, 
which are also roots of the numerator in (5.3), and if a is even, no 
number s v corresponds to v = a/2. Hence, when a is odd, we get all 
a - 1 roots 8 V , putting v = 1, 2, • • •, a - 1; when a is even, the value 
v = a/2 must be omitted. 

We know that 


(5.5) 


q \* 12 sin ( a — z)<t> Pi 


(0 


sin a<t> 


Si — s 


■+.*.+ 


Po-l 
So—1 



292 


RANDOM WALK 


[14.fi 


To find p v multiply both sides by s v — s and let s —> s„. We get (putting 
<t> p = 7 rv/a) as in Chapter 11, formula (7.5), 


(5.6) 


=0 


q\ z/2 sin (a — z)wv/a 


cos vw(d<t)/ds) amtap 
q\* /2 sin z^/a-sin irv/a 
p/ 2 a(p(?)^ cos 2 tv /a 


Hence we get finally from (5.5) for the coefficient u ZtTl of s n 


a —1 


(5.7) tt fi n 


= a ~ l 2 w p (w “* )/ 2 g (n +* )/2 


53 cos 


n—1 


7TV 


7TP 7TZV 

sin — • sin- 

a a 


(Strictly speaking, the term v = a/2 should be omitted when a is even; 
but it is zero anyway and therefore does no harm.) 

For n > 1 formula (5.7) represents the probability of ruin (absorption) 
at the nth trial. It goes back to Lagrange and has been derived in many 
different ways. 7 Despite an honorable history and its availability in 
textbooks, the formula is rediscovered at frequent intervals. For an 
alternative explicit expression see problem 6; for limiting forms cf. 
section 6 and problem 7. 

If we let a —> qo, the sum in (5.7) may be interpreted as a Riemann 
sum approximating an integral. In this way we find that in a game 
against an infinitely rich adversary (single absorbing barrier at x = 0) 
the probability w Zt7l that a player with initial capital z > 0 will be ruined 
exactly at the nth step is 

(5.8) w Zt n = 2 n p (n ” z)/ V n+ * )/2 f cos n_1 tx sin wx sin t xz-dx. 

Jo 


This integral can be expressed in an elementary way 8 as follows 


(5.9) V—J/V^; 

n (n — z)/ 

the binomial coefficient is again to be interpreted as zero if (n — z)/2 
is not an integer of the interval [0, n]. The corresponding generating 
function was found to be \ 2 z (s) (cf. end of section 4). 

7 An elementary derivation using trigonometric interpolation was given by Ellis, 
Cambridge Mathematical Journal, vol. 4 (1844), or The Mathematical and Other 
Writings of R. E. Ellis, Cambridge and London, 1863. 

8 Integrating by parts and observing that cos txz =» cos icx(z — 1) cos rx — 
sin i rx(z — 1) sin irx, we get a recursion formula for w z , n which checks with (5.9). 
A simpler proof consists in verifying that (5.9) is a solution of the difference equa¬ 
tion (4.1) with the appropriate boundary conditions (4.2)~(4.3) at z — 0. 



293 


14.6] PASSAGE TO THE LIMIT; DIFFUSION PROCESSES 

6. Passage to the Limit; Diffusion Processes 

It has already been pointed out that our random-walk models serve 
as a first approximation to the theory of diffusion and Brownian 
motion, where small particles are exposed to a tremendous number of 
molecular shocks. Each shock has a negligible effect, but the super¬ 
position of many small actions produces an observable motion. Ac¬ 
cordingly, we now want to study random walks where the individual 
steps are extremely small and occur in very rapid succession. In the 
limit the process will appear as a continuous motion. The point of 
interest is that in passing to this limit our formulas remain meaningful 
and agree with physically significant formulas of diffusion theory which 
can be derived under much more general conditions by more stream¬ 
lined methods. 9 This explains partly why the random-walk model, 
despite its crudeness, describes diffusion processes reasonably well; 
only the limiting case is physically significant, and various discrete 
models lead to the same limiting formulas. The situation is in many 
ways analogous to the central limit theorem where we saw that under 
extremely general conditions the cumulative effect of many chance 
components is practically independent of the nature of the individual 
components. 

Let us begin with an unrestricted random walk starting at the origin , 
and let v XtH be the probability that the nth step takes the particle to 
the position x. If r among the n steps are directed to the right, n — r 
are directed to the left, and the total displacement is r — (n — r) 
= 2 r — n units. Since this displacement is to equal x, we must have 
2r — n = x. This is possible only if n and x are either both even or 
both odd (which means that after an even number of steps the abscissa 

/n\ 

x is an even integer). Out of n steps r can be selected in ^ j ways, 
and therefore 

(0.1) p ^+x)l2 q (n-x)l2. 

\£(n + x)/ 


here the binomial coefficient should be interpreted as 0 whenever 
(n -f x)/2 is not an integer in the interval [0, w]. 

9 The limiting formulas of the present section agree with those of the now classical 
Einstein-Wiener theory. The newer, more refined theories (Uhlcnbcck, Ornstein) 
are not considered here. Credit for discovering the connection between random 
walks and diffusion is due principally to L. Bachelier (1870- ). His work is fre¬ 
quently of a heuristic nature, but he derived many new results. Kolmogorov’s 
theory of stochastic processes of the Markov type is based largely on Bachelier's 
ideas. Cf. in particular L. Bachelier, Calcul des probability , Paris, 1912. 



294 


RANDOM WALK 


[14.6 


There is an alternative way of deriving (6.1) by using the argument 
which led to the difference equation (4.1) and the boundary conditions 

(4.2) and (4.3). One verifies that v x%n must satisfy the difference 
equation 

(6.2) tfe.n+i = ptfc-l.n + q»x+ l,n 
with the boundary conditions 

(6.3) Vo# = 1, v Xt0 = 0 for 

Given (6.3), we put in (6.2) successively n = 1, 2, ... and get first all 

values v Xt i, and then successively v Xt2 , v XtS , _ This shows that the 

conditions (6.2) and (6.3) uniquely determine v Xt7l . On the other hand, 
it is readily seen that (6.1) is a solution. 

Let us now change the unit of length so that each step has length Ax 
and suppose that the time between any two consecutive steps is At. During 
time t the particle performs about t/At jumps, and a displacement x is 
now equivalent to x/Ax units. Only multiples of Ax and At represent 
meaningful coordinates, but in the limit Ax —> 0, At —> 0 every dis¬ 
placement and all times become possible. 

We must not expect sensible results if we let Ax and At approach 
zero in an arbitrary manner. It suffices to notice that the maximum 
possible displacement in time t amounts to tAx/At , so that in the limit 
no motion exists if Ax/At —> 0. Physically speaking, we must keep the 
x- and ^-scales in an appropriate ratio or the process will degenerate 
in the limit, the velocities tending to zero or infinity. To find the 
proper ratio we note that the total displacement during time t is the 
sum of about t/ At mutually independent random variables each having 
the mean (p — q)Ax and variance {1 — (p — q) 2 \(Ax) 2 = bpq(Ax) 2 . 
The mean and variance of the total displacement in time t are there¬ 
fore about t(p — q)Ax/At and 4pqt(Ax) 2 / At, respectively. To obtain 
reasonable results we must let Ax and At approach zero so that the 
mean and variance remain finite for all t. The finiteness of the variance 
requires that (Ax) 2 /A t should remain bounded; the finiteness of the 
mean implies that p — q must be of the order of magnitude of Ax. 
This suggests putting 


(Ax) 2 1 c 

(6.4) = 22), p = - + — Ax, 

At 2 2D 



where D and c are constants. The numerical value of D introduces 
only a scale factor; for mathematical simplicity it would be best to 
put D — 1, but we keep D unspecified in order to facilitate comparison 



14,6] PASSAGE TO THE LIMIT; DIFFUSION PROCESSES 295 

with physical theories. The constants D and c are, respectively, the 
diffusion coefficient and the drift. If c = 0, the random walk is sym¬ 
metric, and, in general, the sign of c determines the direction of the 
drift. In the limit both p and q approach 1/2; with any other norming 
the particle would drift away so fast that the probability of finite 
displacements would tend to zero. 

We shall use the norming (6.4) to pass to the limit Ax —>0, At —»0. 
The total displacement at time t « nAt is determined by n Bernoulli 
trials, and therefore the limiting form of v Xt7l is known from Chapter 7 
to be given by the normal distribution. The necessary computations 
were effected there and need not be repeated. For a fixed Ax the dis¬ 
placement is the sum of finitely many independent variables, and its 
mean is t(p — q)Ax/At — 2 ct; its variance \pqt{Ax) 2 / At — 2Dt. We 
find therefore that the probability that at time t the displacement lies 
between x 0 and x x (, x 0 < X\) tends to 


(0.5) 



d\ 


where 2/1 = (xi — 2ct)/(2Dt) * a and y 0 = ( 2*0 — 2ct)/(2Dt) y \ (Accord¬ 
ing to the central limit theorem, the same conclusion holds for more 
general random walks.) 

As for equation ( 6 . 2 ), we pass to the usual functional notation and 
write it in the form v(x, t + At) = p- v(x — Ax, t) + q-v(x + Ax, t). 
Expanding according to Taylor’s theorem up to terms of second order, 
we get formally 


( 0 . 6 ) 


At • 


dv(x, t) 
dt 


(<7 - p)Ax ■ 


dv(x, t) 
dx 


(A x) 2 d 2 v(x,t) 

+ 2 dx 2 + 


Using (6.4), we get in the limit 


(0.7) 


dv(x, t) 
dt 


— 2c • 


dv{x, t) 
dx 


+ D- 


d 2 v(x, t) 
dx 2 


This is the well-known Fokker-Planck equation for diffusion with 
drift, which can be derived from more general and more convincing 
assumptions. In the usual theory, the solution (6.5) is derived from 
(6.7), while we have obtained both results by the same limiting process. 
Our procedure is only heuristic but can be justified more rigorously. 
The fact is that all formulas of the discrete random walk permit a 
similar passage to the limit. 

As a further example, consider the limiting form of the probabilities 
for the first passage . For simplicity let us first consider formula (5.9) 



296 


RANDOM WALK 


[14.6 


which corresponds to a single barrier. Of the two quantities w Ztfl 
and w Zin + 1 , one is necessarily zero. The sum w z , n + w Ztfl +i represents, 
asymptotically, the probability of absorption during the time interval 
(i t , t + 2A t). We shall show that w ZtH + w* in +i ~/(«, t)(2At), where 
f(z , t) is a continuous function. Then the limiting probability of 
absorption within any time interval (ti f is the integral of /(z, t) 
extended over that interval. Suppose now that n — z is even. Then 
w ZtTl +1 = 0, and to find /(z, t) we must replace z in (5.9) by z/Ax and n 
by t/At y and apply (6.4). Using the normal approximation to the 
binomial distribution and the last equation (6.9), we find easily 


(6 - 8) /fa 0 ~ e~ ( * +2< * ),/(4D,> . 

This is the limiting form of (5.9); again it coincides with the corre¬ 
sponding formula of diffusion theory. In fact, it is easily verified that 
/(—x, t) is a solution of (6.7). (In the definition of w ZtH the variable 
z plays the role of —x in v XtU .) 

A similar argument applies to (5.7). An inspection of this formula 
shows that the contributions of v = k and v = a — k cancel if n — z 
is odd and add if n — z is even. Hence we get the limiting form of 
/(z, t) ~ (u Ztn + u ZtTl+ i)/(2At) by extending in (5.7) the sum twice 
over 1 < v < a/2. Replace z, a, n respectively by z/Ax, a/Ax, t/At 
and observe that for fixed v 


wvAx tvAx 

sin-~- 

a a 

( irvAx\/“ / Dir 2 v 2 AtY* n a 2#/ 3 

(6.9) V 508 —) -^—) ~e~ D ' W , 

/n\z/2 

( 4 pqY /2M (^J ~ e ~ c{ct + z)/D % 

We obtain formally the limiting form 

00 

(6.10) /(z, t) ~ 2irDa~ 2 e~ c(ci+l)/D X) ve~ Dlt '’ H/a ' sin — 

a 

The formal passage to the limit is justified because of uniform con¬ 
vergence: the contribution of the terms with large v is negligible both 
in (6.10) and in the original sum (5.7) (where we have v < a/2). 

In diffusion theory (6.10) is known as Fttrth's formula for first 
passages and is derived directly from the Fokker-Planck equation. 



14.7J RANDOM WALKS IN THE PLANE AND SPACE 297 

In free diffusion the integral over (6.10), extended over the time interval 
(^i, gives the probability that a particle starting at z > 0 will within 
that time interval for the first time reach the origin and not have 
previously passed the barrier x = a. 

* 7. Random Walks in the Plane and Space 

In a two-dimensional random walk the particle moves in unit steps 
in one of the four directions parallel to the x- and y- axes. If the particle 
starts at the origin, the possible positions are all points of the plane 
with integral-valued coordinates. Each position has four neighbors. 
Similarly, in three dimensions each position has six neighbors. In 
order to define the random walk the corresponding four or six prob¬ 
abilities must be specified. For simplicity we shall consider only the 
symmetric case where all directions have the same probability. The 
complexity of problems is considerably greater than in one dimension, 
for now the domains to which the particle is restricted may have arbi¬ 
trary shapes so that complicated boundaries take the place of the 
single-point barriers in the one-dimensional case. 

We begin with an interesting theorem due to Polya. 10 

Theorem. In the symmetric random walks in one and two dimensions 
there is probability one that the particle will sooner or later (and therefore 
infinitely often) return to its initial position. In three dimensions , how¬ 
ever, this probability is only about 0.35 [the expected number of returns 
is then 0.652^(0.35)^ = 0.35/0.65 « 0.53]. 

Before proving the theorem let us give two alternative formulations, 
both due to Polya. First, it is almost obvious that the theorem implies 
that in one and two dimensions there is probability 1 that the particle 
will pass infinitely often through every possible point ; in three dimensions 
this is not true, however. Thus the statement “all roads lead to 
Rome” is, in a way, justified in two dimensions. 

Alternatively, consider two particles performing independent sym¬ 
metric random walks, the steps occurring simultaneously. Will they 
ever meet? To simplify language let us define the distance of two 
possible positions as the smallest number of steps leading from one 
position to the other. (Then distance = sum of absolute differences of 

* Starred sections treat special topics and may be omitted at first reading. 

10 G. Polya, ttber eine Aufgabe der Wahrscheinlichkeitsrechnung betreffend die 
Irrfahrt im Strassennetz, Matheniatische Annalen , vol. 84 (1921), pp. 149-160. The 
numerical value 0.35 was calculated by W. H. McCrea and F. J. W. Whipple, 
Random Paths in Two and Three Dimensions, Proceedings of the Royal Society of 
Edinburgh, vol. 60 (1940), pp. 281-298. 



RANDOM WALK 


[14.7 


the coordinates). If the two particles move one step each, their mutual 
distance either remains the same or changes by two units. Accordingly, 
the distance of our two particles either is even at all times or else is 
always odd. In the second case the two particles can never occupy 
the same position. In the first case it is readily seen that the probability 
of the two particles meeting at the nth step equals the probability of 
the first particle to reach in 2n steps the initial position of the second 
particle. Hence our theorem states that in two, but not in three, 
dimensions the two particles are sure infinitely often to occupy the 
same position. If the initial distance of the two particles is odd, a 
similar argument shows that they will infinitely often occupy neighbor¬ 
ing positions. If this is called meeting, then our theorem asserts that 
in one and two dimensions the two 'particles are certain to meet infinitely 
often f while in three dimensions there is a positive probability that they 
never meet. 


Proof . For one dimension the theorem has been proved in Chapter 
12 , except that there we referred to a coin-tossing game rather than 
to a symmetric random walk. The proof for two and three dimensions 
proceeds along the same lines. Let u n be the probabilite that the nth 
trial takes the particle to the initial position. According to theorem 2 
of Chapter 12, section 3, we have to prove that in the case of two dimen¬ 
sions Zn n diverges, while in the case of three dimensions 2w n « 0.53. 
In two dimensions a return to the initial position is possible only if the 
numbers of steps in the positive x- and ^/-directions equal those in the 
negative x - and ^-directions, respectively. Hence u n = 0 if n is odd 
while (using the multinomial distribution of Chapter 6, section 7) 

„„ 1 £ ( 2»)1 

(7.1) u 2n — , 2 „ 


4 2n fc = 0 klk\(n - k)\(n 


_ = J_ (2n\ - /nV 

— k)\ 4 2n Vn/ jh\k) 


The last expression equals 4 


— 2n 


a 


by Chapter 2, formula (9.8). 


Stirling’s formula shows that u 2n is of the order of magnitude 1/n, so 
that Stt 2n diverges as asserted. 

In the case of three dimensions we find similarly 


(7.2) 


U 2n 


J_ y _ (2n)l _ 

6 2n i* jlj\k\k\(n — j — k)\(n — j — k)l ’ 


the summation extending over all j, k with j + k <n. It is easily 
verified that 


_1_ /2n\ J J_ n! 

2 2B V n ) it l3 n j\k\{n - j - k)\ 


(7.3) 



14.7] 


RANDOM WALKS IN THE PLANE AND SPACE 


299 


Within the braces we have the terms of a trinomial distribution, and 
we know that they add to unity. Hence the sum of the squares is 
smaller than the maximum term within braces, and this is attained 
when both j and k are close to n/3. Stirling’s formula shows that this 
maximum is of the order of magnitude n —1 , and therefore Uzn is of the 
magnitude so that 2 m 2/1 converges as asserted. 

Polya’s theorem is analogous to the facts concerning multiple coin 
tossings discussed in Chapter 12, example (3.e). 

We conclude this section with another problem which generalizes 
the concept of absorbing barriers. To fix ideas we consider the case 
of two dimensions where instead of the interval 0 < x < a we have a 
plane domain I) } that is, a collection of points with integral-valued 
coordinates. Each point has four neighbors, but for some points of D 
one or more of the neighbors lie outside D. Such points form the 
boundary of D , while all other points are called interior points. In 
the one-dimensional case the two barriers form the boundary, and our 
problem consisted in finding the probability that, starting from z , 
the particle will reach the boundary point .r = 0 before reaching x = a. 
By analogy, we now ask for the probability that the particle will 
reach a certain section of the boundary before reaching any boundary 
point which is not in this section. This means that we divide all 
boundary points into two sets B' and B". If (a*, y) is an interior 
point, we ask for the probability u(x, y) that, starting from (x, y), the 
particle will reach a point of B' before reaching a point of B”. In 
particular, if B' consists of a single point, then u(x, y) is the probability 
that the particle will, sooner or later, be absorbed at that particular 
point. 

Let (x, y) be an interior point. The first step takes the particle 
from (x y y) to one of the four neighbors (.r ± 1, ?/), (.r, y ± 1), and if 
all four of them are interior points, we must have 


(7.4) 


u(x,y) = \{u(x + 1, y ) + u(x - 1, y) 

+ u{x, y + 1) + n(x, y — 1)}. 


This is a partial difference equation which takes the place of (2.1) 
(with p = q = 1/2). If (x + 1, y) is a boundary point, then its con¬ 
tribution u(x + 1, y) must be replaced by 1 or 0, according to whether 
(x + 1, y) belongs to li' or B". Hence (7.4) will be valid for all interior 
points if we agree that for a boundary point (£, v) we put u(£, y) = 1 if 
({, y) is in B' and u(|, y) = 0 if (£, y) is in B". This convention takes 
the place of the boundary conditions (2.2). 



300 


RANDOM WALK 


[14.8 


In (7.4) we now have a system of linear equations for the unknowns 
u(x , y); to each interior point there correspond one unknown and one 
equation. The system is non-homogeneous, since in it there appears at 
least one boundary point (£, rf) of B' and it gives rise to a contribution 34 
on the right side. If the domain D is finite, we have as many equations 
as unknowns, and it is well known that the system has a unique solu¬ 
tion if and only if the corresponding homogeneous system [with 
w(£, v) = 0 for all boundary points] has no non-vanishing solution. 
Now u(x, y) is the mean of the four neighboring values u(x =fc 1, y), 
u(x, y dt 1) and hence cannot exceed all four. In other words, u(x> y) 
cannot have either a maximum or a minimum in the strict sense, so 
that the greatest and the smallest value occur at boundary points. 
Hence, if all boundary values vanish, so does u{x , y) at all interior 
points. This proves the existence and uniqueness of the solution of 
(7.4). Since the boundary values are 0 and 1, all values u(x, y) lie be¬ 
tween 0 and 1, as is required for probabilities. These statements are 
true also for the case of infinite domains, as will be seen from a general 
theorem on infinite Markov chains. 11 

8. The Generalized One-dimensional Random Walk (Sequential 

Sampling) 

We now return to one dimension but consider the general case where 
the particle does not necessarily pass to a neighboring point. The 
possible positions are still x = 0, ±1, =L2, .... However, we shall 
assume that at each step the particle has probability Pk to move from x to 
the point x + k; here the integer k is allowed to be zero, positive, or 
negative (in the ordinary random walk p x = p, p__i = q). We shall 
consider the following ruin problem. The particle starts from a position z 
with 0 < z < a; required is the probability u z that the particle will arrive 
at some position x < 0 before reaching any position x > a. An interpre¬ 
tation of this problem in the terminology of gambling was discussed 
at the end of section 1, where it was also stated that our problem is 
of great importance in Wald’s sequential analysis. There the term 
“rejection” is used instead of ruin. Also, in Wald’s terminology the 
particle always starts from x = 0, and the game (i.e., sampling) 
terminates when it reaches a position to the left of x = —6 or to the 
right of x = a; this, however, is only a notational difference. 12 

11 Explicit solutions are known in only a few cases and are always very compli¬ 
cated. Solutions for the case of rectangular domains, infinite strips, etc., will be 
found in the paper by McCrea and Whipple cited in footnote 10. 

12 Wald treats also the case of continuous variables and uses different tools. 
The present methods apply also to general random variables. 



14.8] GENERALIZED ONE-DIMENSIONAL RANDOM WALK 301 

Without loss of generality we shall suppose that steps are possible 
in both the positive and negative directions. Otherwise we would 
have either u z — 0 or u z = 1 for all z. 

The probability of ruin at the first step is obviously 

(8.1) r, = + p_ z _ 1 + p — 2 — 2 -|- 

(a quantity which may be zero). After the first step the random walk 
continues only if the particle moved to a position x with 0 < x < a. 
The probability of a jump from z to x is p x -z, and the probability of 
subsequent ruin is then u x . Therefore 

a—1 

(8.2) u z = £ UxVx-z + r z . 

X = 1 

Once more we have here a — 1 linear equations for a — 1 unknowns 
u z . The system is non-homogeneous, since at least for z = 1 the 
probability r\ is different from zero (steps in the negative direction 
being possible, which obviously implies ri >0). We claim that the 
corresponding homogeneous system 

a—1 

(8.3) u z = u x j ) x — 2 

X = 1 

can have only the solution u x = 0. 

In fact, if it had another solution, one of the values u z would be 
largest in absolute value, say u z = M > 0. Suppose first that i 0. 
Since the coefficients p x —z in (8.3) add to at most unity, the equation 
is possible only if all those p x - z which actually appear on the right 
side (with a coefficient different from zero) equal il/, and if their 
coefficients add to 1. Hence u z ~i = A/, and, arguing the same way, 
2 = u *_3 = • • • — Ui — M. However, for z = 1 the coefficients 
p x -. z in (8.3) add to less than unity, so that M must be zero. The 
same argument obviously applies also if = 0, since we can replace 
i by some other coefficient pk with k < 0 which is positive. 

It follows that (8.2) has a unique solution, and thus our problem is 
determined. Equation (8.2) plays the role of the difference equation 
(2.1). Again we can simplify the writing by introducing the boundary 
conditions 

u x = 1 if x < 0 

(8 ‘ 4 > ^ 

u x = 0 if x > a. 

Then (8.2) can be written in the form 

(8.5) = JmjlljPx - Z y 



302 


RANDOM WALK 


[14.8 


the summation now extending over all x [for x > a we have no con¬ 
tribution owing to the second condition (8.4); the contributions for 
x < 0 add to r z owing to the first condition]. 

For large a it is cumbersome to solve a — 1 linear equations directly, 
and it is preferable to use the method of particular solutions analogous 
to the procedure of section 2. It works whenever the probability 
distribution { pk } has relatively few positive terms. Suppose that only 
the pk with — v < k < m are different from zero, so that the largest 
possible jumps in the positive and negative directions are n and v y 
respectively. Consider the characteristic equation 

(8.6) 2p k s k = 1. 

It is equivalent to an algebraic equation of degree v + m- If s is a root 
of (8.6), then u z = s* is a formal solution of (8.5) for all 2, but this 
solution does not satisfy the boundary conditions (8.4). If (8.6) has 
n + v distinct roots s x , s 2 , ..., then the linear combination 


(8.7) 


u 2 — 2 lAkSk* 


is again a formal solution of (8.5) for all z. We have to adjust the 
constants Ak so that the boundary conditions are satisfied. Now for 
0 < z < a only values x with — + 1 appear in 

(8.5). It suffices therefore to satisfy the boundary conditions (8.4) 
for x = 0, —1, —2, • • •, — v + 1, and x = a, a + 1, • • a + n — 1, 
so that we have /x + v conditions in all. If Sk is a double root of (8.5), 
we lose one constant, but in this case it is easily seen that u z = zsk z is 
another formal solution. In every case the m + v boundary conditions 
determine the n + v arbitrary constants. 

Example. Suppose that each individual step takes the particle to 
one of the four nearest positions, and we let p_ 2 = p~ i = P\ = P2 
= 1/4. The characteristic equation (8.6) is $"" 2 + s~~ l + s + $ +2 = 4. 
To solve it we put t = s + s ~ l : with this substitution our equation 
becomes t 2 + t — 6, which has the roots t =* 2, —3. Solving 
t = $ + s”* 1 for s, we find the four roots 


(8.8) = $2 — 1> s 3 — 


-3 + 5 h 


— S4 1 , 


-3 - 5 h 


84 = 


-1 


Since s x is a double root, the general solution of (8.5) in our case is 
(8.9) u z = A x - f* A 2 Z + A^sq* + A 484 *. 

The boundary conditions are u 0 = u„ x = 1, and u a = u a +1 = 0. 



14.8] GENERALIZED ONE-DIMENSIONAL RANDOM WALK 303 

They lead to four linear equations for the coefficients Aj and to the 
final solution 

(8 10 ) u = 1 - - + — ~ a)(S3<> ~ S4<1) ~ a(Sa2Z ~ 8 ~ S ±^l 
* a a{(a + 2)(s 3 ° — s 4 a ) — a(s 3 ° +2 - s 4 ° +2 )} 

with s 3 and s 4 given by (8.8). 


Numerical Approximations. If the degree p + v of the characteristic equation 
(8.6) is not very small, then it is cumbersome to find all its roots. In practice 
rather satisfactory approximations can be obtained in a surprisingly simple way. 
Consider first the case where the probability distribution \pk\ has mean zero. 
Then the characteristic equation (8.6) has a double root at s = 1, and hence A + Bz 
is a formal solution of (8.5). Of course, the two constants A and B do not suffice 
to satisfy the p + v boundary conditions (8.4). However, if we determine A and B 
so that A + Bz vanishes for z — a + p — 1 and equals 1 for z — 0, then we shall have 
A -f Bx > 1 for x < 0 and A + Bx > 0 for a < x < a 4* /*• Our A + Bz then 
satisfies the boundary conditions (8.4) with the equality sign replaced by “greater 
than or equal to.” The difference A + Bz — u z is therefore a formal solution of 
(8.5) with non-negative boundary values whence A 4* Bz — u t > 0. In like manner 
we can get a lower bound for u x by determining A and B so that A Bz vanishes 
for z = a and equals 1 for z — —v + 1. Hence we have 


( 8 . 11 ) 


a — z a -f /z ~ z — 1 

- < Uz < -- 

a v — 1 a -f- /x — 1 


If a is large as compared to ^ -f v, we have here an excellent estimate. (Of course, 
u z « (1 — z/a) is a better approximation, but does not give precise bounds.) 

Next, consider the general case where the mean of the distribution \pk\ is not 
zero. The characteristic equation (8.6) has then a simple root at s = 1. The left 
side of (8.6) approaches as s —> 0 and as s —► °o. It is continuous for s > 0, 
and its second derivative is positive; this means that for positive s the curve 
y = Hp/cS 1 * is continuous and convex. Since it intersects the line y = 1 at s = 1 , 
there exists exactly one more intersection. Therefore, the characteristic equa¬ 
tion (8.6) has exactly two positive roots, 1 and 8\. As before, we see that A -f* Bs\* 
is a formal solution of (8.5), and we can apply our previous argument, substituting 
this solution for A + Bz. We find in this case 


( 8 . 12 ) 


si a - si* ^ ^ - 5l * 

si° - sr y + l - UtS Sl ' a +*- 1 - 1 * 


Hence the 

Theorem. The solution of our ruin problem satisfies the inequalities (8.11) if \pk) 
has zero mean , and (8.12) otherwise. Here s\ is the unique positive root different from 
1 of (8.6), and u and —v are defined , respectively, as the largest and smallest subscript 
for which pk 5 * 0. 

Let m ~ 2kpk be the expected gain in a single trial (or expected length of a single 
step). It is easily seen from (8.6) that s\ > 1 or s\ < 1 according to whether m < 0 
or m > 0. Letting a —* °o, we conclude from our theorem that in a game against 
an infinitely rich adversary the probability of an ultimate ruin is one if and oidy if 
m < 1. 

The duration of game can be discussed by similar methods (cf. problem 18). 



304 


RANDOM WALK 


[14.9 


9. Problems for Solution 

1. We modify the ruin problem of section 2 so that the gambler has probability 
a to win a dollar, probability 0 to lose a dollar, while with probability y the trial 
ends in a tie. Show that the probability of ruin is still given by (2.4) and (2.5) with 
p =* «/( 1 — y), q = 0/(1 — y). The expected duration of the game is D,/(l — y) 
with D z given by (3.4) and (3.5). 

2. In the one-dimensional random walk with absorbing barriers at x » 0 and 
x ■* a, let z be the initial position, and let w ZfH {x) be the probability that the nth 
step takes the particle to x. Show that w z , n (x) satisfies the difference equation 
w z ,n+i(x) = pw z +\,n(x) + qw z —i, n (x) with the boundary conditions (1) WQ, n (x) 
= w a .n(x) — 0 for n > 1; (2) w z ,o(x ) ~ 0 if z x and w Xl o(x) «= 1. 

3. In problem 2 let there be reflecting barriers at x = H and x = a — (cf. 
section 1). Show that the statements of the preceding problem hold with the 
boundary conditions (1) replaced by wo, n (x) = wi, n (x ) and w a<n (x) = w a -i, n (x). 

In the following problems v x<n is always the probability that in an unrestricted 
random walk starting at the origin the nth step takes the particle to the position x. This 
probability is given by (6.1); the symbols u z<nj Xi(s), X 2 (s), w z<n will be used as defined 
in sections 4 and 5. 

4. Suppose there is a single absorbing barrier at the origin. Let u z , n {x) be the 
probability that a particle starting at x > 0 is after n steps at z > 0. If the random 
walk is symmetric (p = q — 1/2), show that u ZtTl (x) = v z —x,n — v z + Xt n- 

Hint: Show that a difference equation similar to (4.1) and the appropriate 
boundary condition are satisfied. 

5. Continuation , 13 If there are absorbing barriers at x «* 0 and x =* a, show 
that 

U z ,n(x) = {Vz—x— 2ka,n 2ka,n }» 

k 

the summation extending over all k, positive and negative (only finitely many terms 
are different from zero). 

6. Alternative formula for the probability of ruin (5.7). Expanding (4.11) into a 

geometric series, prove that , 

00 fP\ ka 00 / p\ ka ~ z ( 

u *,n ~ “ ) Wz+2ka,n 53 ( “ ) W2ka—z,n 

fc -0 \q/ k~*l \q/ 

with w tt n defined in (5.9). 

7. If the passage to the limit of section 6 is applied to the expression for u Ztn 
given in the preceding problem, show that the probability of absorption during a 
short time interval of length At is asymptotically 14 

^ At(vDt 3 )-' / ^e~ c< - ct+ ‘ )ID £ (z + 2ka) e -(*+2*<-)V4« 

2 k» — oe 

Hint: Apply the normal approximation to the binomial distribution. 

13 This solution is obtained by the method of images used in potential theory and 
due to Lord Kelvin. The term v z „ Xtn is the desired probability in the absence of 
barriers (free random walk); then v z+x ,n represents the corresponding probability 
for an “image” particle which starts at the point —x (the point x mirrored at the 
left barrier); v z + x - 2 a corresponds to another “image” starting at the point 2a — x 
(which is x mirrored at the right end); there follow images of images, etc. 

14 The agreement of the new formula with the limiting form (6.10) is a well- 
known fact of the theory of theta functions. 



14.9] 


PROBLEMS FOR SOLUTION 


805 


8. 1# First passage times. In an unrestricted random walk starting at x = 0, let 
g n be the probability that the particle reaches x = — 1 for the first time at the nth 
step. Without using any previous results prove directly that the generating func¬ 
tion G(s) of \g n \ satisfies the equation G(s) — qs ps G 2 {s). Hence show that G{s) 
=* X 2 (s) with M(s) defined in (4.8). Similarly, l/\\(s) = p\2(s)/q is the generating 
function of the first passage time through x = +1. Show also that this implies 
that the first passage time through any position z has the generating function 
X 2 ~“ Z W if z < 0 and Xi“*(s) if z > 0 [cf. (4.13) which corresponds to the first 
passage through — z ]. 

9. Continuation: recAirrence times. Let f n be the probability that the particle 
returns to its initial position for the first time at the nth step. If F(s) is the cor¬ 
responding generating function, we must have F(s) — ps\2(s) + qs/\i(s). [Note: 
This is a new derivation of the equation F(s) = 1 — (1 — 4 pqs 2 )^ found in 
Chapter 12, section 3.] 

10. Let V x (s) * 2 v x , n s n (cf. the note preceding problem 4). Using the results of 
problem 8, prove that V x (s) — Vo(s)M~' T (s) if x < 0 and F z (s) = Vo(s)\i~ x (s) if 
x > 0. [Note: These relations are almost obvious and should be proved without 
calculations. It is easily verified that Vo(s) = (l — Apqs 2 )~^.] 

11. Renewal method for the ruin problem. In the random walk with two absorbing 
barriers of section 4 let and u z , n * be, respectively, the probabilities of absorp¬ 
tion at the left and the right barriers. By a proper interpretation prove the truth 
of the following two equations: 

V-M - U z (s)Vo(s) + V*(s)V„ a (s) t 
V a —z{s) - U z (s)V a {s) + U z *(s)Vo(s). 


By solving this system for U x {s) 1 derive (4.11). 

12. Let u Zt n(x) be the probability that the particle, starting from z, will at the 
nth step be at x without having previously touched the absorbing barriers. Using 
the notations of problem 11, show that for the corresponding generating function 
U z (s;x) = Hu Zt n(x)s n we have U z (s;x ) = F z _*(s) — U z (s)V x {s) — U z *(s)V x -a(s). 
(No calculations are required.) 

13. Continuation. The generating function U z (s;x) of the preceding problem 
can be obtained by putting U z {s;x) = V x - Z (s) — A\i ? (s) — BM z (s) and determin¬ 
ing the constants so that the boundary conditions U z (s;x) = 0 for z =0 and z « a 
are satisfied. If there are reflecting barriers at and a — the boundary condi¬ 
tions are Uo(s:x ) = U\{s:x) and U a (s:x) = U a -i(s:x). 

14. A symmetric unrestricted random walk starts at the origin. The probability 
that the rth return to the origin occurs at the nth step equals the probability that 
the first passage through x *= r occurs at the (n -f r)th step. {Hint: Compare the 
generating functions.) 

15. Prove the formula 

Vx.n - (2ir)~ l 2 »p(n+T) f 2 q (n-T)f2 f cos « j. cog 

J -ir 


by showing that the appropriate difference equation is satisfied. Conclude that 16 


VM 



cos tx 

1 — 2 {pq) ^ • s • cos t 


dt. 


u Problems 8—13 contain a new and independent derivation of the main results 
concerning random walks in one dimension. 

19 The formulas of problem 10 now follow easily by the calculus of residues. 



306 


RANDOM WALK 


[14.9 


16. In a three-dimensional symmetric random walk the particle has probability 
one to pass infinitely often through any particular line x — m, y =* n. (Hint: Cf. 
problem 1.) 

17. In a two-dimensional symmetric random walk starting at the origin the 
probability that the nth step takes the particle to (x,y) is 

(2*r 2 2-"£j_ r (cos a -f cos p) n • cos xol • cos yp * da dp. 

Verify this formula and find the analogue for three dimensions. (Hint: Check that 
the expression satisfies the proper difference equation.) 

18. In the generalized random-walk problem of section 8 put [in analogy with 
(8.1)] p z — p a -*-f Pa+i-*-f- Pa+ 2 -*+ ..., and let d Ztn be the probability that the 
game lasts for exactly n steps. Show that for n > 1 

o-l 

dg , n +1 — 23 

ar— 1 

with d z , i = r* + p z . Hence prove that the generating function d z (s) = 2d*, n s n is 
the solution of the system of linear equations 

0—1 

d z (s) - d x (s) p x - z - r z -f- Pz . 

*»1 

By differentiation it follows that the expected duration e z is the solution of 

o—l 

2^ VxPx—Z “ 
z«=l 


1 . 



CHAPTER 15 


MARKOV CHAINS 


1. Definition 

Up to now we have been concerned mostly with independent trials, 
which can be described as follows. A set of possible outcomes E \, 
E 2 , ..., (finite or infinite in number) is given, and with each there is 
associated a probability p k ; the probabilities of sample sequences are 
defined by the multiplicative property Pr{(E JQ , E hl •••, E Jn )} 
= pj Q • • • pj n . In the theory of Markov 1 chains we consider the sim¬ 
plest generalization which consists in permitting the outcome of any 
trial to depend on the outcome of the directly preceding trial (and only 
on it). The outcome E k is then no longer associated with a fixed prob¬ 
ability pky but to every pair (Ej, E k ) there corresponds a fixed condi - 
tional probability p jk : given that Ej has occurred at some trial, the 
probability of E k at the next trial is pj k - In addition to the pj k we 
must be given the probability a k of the outcome E k at the initial trial. 
If the pjk are to have the meaning that we attributed to them, then 
we must define the probabilities of sample sequences corresponding to 
two, three, or four trials by 

Pr{{hjy Ek )) = djPjky Pr{(Ej, E k) E r )} = ajPjkPkn 
Pr{(Ej , Eky Efy Eg) | = QjPjkPkrPrsy 

and generally 

Pr{(Z?y 0 , Ej v • • EJj n )} QjoVjojiVjijz * * * Vj n - 2 in -1 Vj n -1• 

Here the initial trial is numbered zero, so that trial number one is the 
second trial. (This convention is convenient and has been introduced 
tacitly in the preceding chapter.) Before scrutinizing the legitimacy 
of the definition (1.1), two simple examples will render the notion more 
intuitive. The next section contains more interesting illustrations. 

Examples, (a) Suppose we are given two unbalanced coins with faces 
marked E x and E 2 such that with the first coin these faces have prob¬ 
abilities a and p = 1 — a, with the second a! and = 1 — a'. One 

1 A. A. Markov (1850-1922). 


307 



308 MARKOV CHAINS [15.1 

of the two coins is selected at random and tossed; this is the initial (or 
zero) trial. Each following trial consists in tossing the first or second 
coin, according to whether the preceding trial resulted in E x or E 2 . The 
probabilities of E\ and E 2 at the initial (or zeroth) trial are obviously 
<L\ = -J(<* + a ') an d a 2 = HP + P'), respectively. Moreover, p n = a, 
P 12 = P, P 21 = a', P 22 = P'- The probability of Ei at the second trial is 
Pr{(E u E x )} + Pr{(E 2 , E t )} = i(a + ct)a + i(p + p')a', etc. 

( b ) Independent trials are illustrated by drawings with replacement 
from an urn of fixed composition. Similarly, our new type of trials 
may be realized by drawings from a sequence of urns. In the , 7 th 
urn balls of various colors are represented in the proportions 
Pji : Pj 2 • Pj 3 : • • •• If a drawing has resulted in a ball of jth color, the 
next drawing is made from the urn numbered j. 

It is clear that, if a k is the probability of E k at the initial (or zero-th) 
trial, we must have a k > 0 and Sa* = 1. Similarly, since whenever Ej 
occurs it must be followed by some E k , we must have for all j and k 

(1-2) Pjl + Pj2 + Pjs + • • • = 1, Pjk > 0. 

We want to show that for any numbers a k and pj k satisfying these 
conditions, the assignment ( 1 . 1 ) is a permissible definition of probabil¬ 
ities in the sample space corresponding to n + 1 trials. Since the 
numbers defined in ( 1 . 1 ) are obviously non-negative, we need only 
prove that they add to unity. Now first fix j 0f ju • • •, j n —1 and add 
the numbers (1.1) for all possible j n . Using (1.2) with j = j n — 1 , we 
see immediately thatq|he sum is ajj)^ * • • Pj n _ 2 j n _ v Thus the sum 
over all numbers ( 1 . 1 ) does not depend on n, and since S a J0 = 1 , the 
sum equals unity for all n. 

The definition ( 1 . 1 ) depends formally on the number of trials, but 
our argument proves the mutual consistency of the definitions ( 1 . 1 ) 
for all n. For example, to obtain the probability of the event “the 
first two trials result in (Ej, E k ),” we have to fi xj 0 = j and ji = k, and 
add the probabilities ( 1 . 1 ) for all possible j 2 , j 3 , • • •, j n . We have just 
shown that the sum is ajpj k , and thus independent of n. This means 
that it is usually not necessary explicitly to refer to the number of 
trials: the event (Ej q , • • •, Ej r ) has the same probability in all sample 
spaces of more than r trials. In connection with independent trials 
it has been pointed out repeatedly that, from a mathematical point of 
view, it is most satisfactory to introduce only the unique sample space 
of unending sequences of trials and to consider the result of finitely 



lfi.ll 


DEFINITION 


309 


many trials as the beginning of an infinite sequence. This statement 
HgWs true also for Markov chains. Unfortunately, sample spaces of 
infinrt^^ trials lead beyond the theory of discrete probabilities 
to which we am^stricted in the present volume. 

To summarize, our starting point is the following 

Definition. A sequence of trials with possible outcomes E i, E 2 , ... will 
be called a Markov chain 2 if the probabilities of sample sequences are 
defined by (1.1) in terms of an initial probability distribution {afor 
the states Ek at time 0 and fixed conditional probabilities pjk of Ek , given 
that Ej has occurred at the preceding trial . 

We shall now modify our terminology so as to conform to the usage 
in physical applications. Instead of saying “the nth trial results in 
Ek” we shall say that at time n the system is in state AV The conditional 
probability pjk will be called the probability of the transition Ej —> Ek 
(from state Ej to state E k ). ^ 

The transition probabilities pjk will be arranged in a matrix of transi¬ 
tion probabilities 


(1.3) 


P - 


~P11 

P\2 

Pl3 •••' 

V2\ 

V22 

P23 * * * 

P3l 

P32 

P33 





where the first subscript stands for row, the second for column. Clearly 
P is a square matrix with non-negative elements and unit row sums. 
Such a matrix (finite or infinite) is called a stochastic matrix. Any 
stochastic matrix can serve as a matrix of transition probabilities; together 
with our initial distribution {a&} it completely defines a Markov chain . 

In some special cases it is convenient to number the states starting 
with 0 rather than with 1. A zero row and zero column are then to 
be added to P. 


* This is not the standard terminology. We are here considering only a special 
class of Markov chains, and, strictly speaking, here anci in the following sections 
the term Markov chain should always be qualified by adding the clause “with 
constant transition probabilities.” Actually, the general type of Markov chain 
is rarely studied. It will be defined in section 10, where the Markov property will 
be discussed in relation to general stochastic processes. There the reader will also 
find examples of dependent trials that do not form Markov chains. 






310 


MARKOV CHAINS 


[15.2 


2. Illustrative Examples 

This section contains a list of various special examples which will 
familiarize the reader with the notion of a Markov chain. To save 
space we shall repeatedly refer to these examples to illustrate various 
definitions and theorems. The reader is advised not to attempt to 
keep these examples continuously in mind but to consider each refer¬ 
ence to them independently. For an application of Markov chains 
to card shuffling cf. section 9. 

I. Independent Trials . Let pjk = a>k be independent of j, so that all 
rows of the transition matrix P are identical. Then the trials are 
independent. For example, to Bernoulli trials there corresponds a 
2 by 2 matrix with rows (p, q). 

II. Success Runs . Consider a sequence of Bernoulli trials, and let 
us agree to say that at time t the system is in state Ek (k = 1,2, ...) if 
the fth trial results in a success which is the kth success in an uninter¬ 
rupted sequence [in other words, if the trials numbered t , t — 1, 
J — 2, • • •, t — k + 1 resulted in success but the (t — k)th trial in 
failure]. Further, we say that at time n the system is in state E 0 if the 
nth trial resulted in failure; at time 0 the system starts from the state 
E 0 . We have here a Markov chain with states E 0 , 2?i, iE 2 , ...; the 
initial distribution is (1, 0, 0, 0, ...); the transition probabilities are 
defined by Pj,j+i = p, p/.o = and pjk = 0 whenever k is not either 
j + 1 or 0. Thus 



V 

0 

0 

0 •••' 

q 

0 

V 

0 

0 ••• 

q 

0 

0 

V 

0 ••• 







III. Random Walk with Absorbing Barriers. The random-walk 
problems of the preceding chapter are examples of Markov chains 
with positions playing the roles of states. If there are absorbing bar¬ 
riers at x = 0 and x = a, the possible states are E 0 , E u • • •, E a with 
Ek standing for x = k. For 1 < j < a — 1 the system can pass from 
Ej either to or to Ej+ 1 , but no further change is possible once the 
system reaches either E 0 or E a . Hence p 00 = Paa = 1, and pjj+i = V> 
PM-1 5=8 ff provided 1 < j < a — 1; all other transition probabilities 
vanish. The matrix P is given by 





15.2] 


ILLUSTRATIVE EXAMPLES 


311 


P = 


"1 

0 

0 

0 

... 0 

0 

0“ 

9 

0 

V 

0 

... 0 

0 

0 

0 

9 

0 

V 

... 0 

0 

0 

0 

0 

0 

0 

... q 

0 

V 

_0 

0 

0 

0 

... 0 

0 

1 _ 

,ies 

are, 

in 

principle, arbitrary. 


chapter we have assumed that the particle starts from the state E z . 
This corresponds to a z = 1, ak = 0 for z 5^ k. 


IV. Reflecting Barriers. We modify the preceding example so that 
the possible states are E u E 2y • • •, E a (with E 0 omitted). From the 
interior states E 2y E 3y •• •, E a — 1, the* system can pass either to the 
right or to the left neighbor, exactly as in the ordinary random walk. 
However, from E iy the system has probability p to pass to E 2 and 
probability q to stay in E\. Similarly, from E a only the transitions 
E a —> E a and E a —> E a ~ 1 are possible, and the corresponding probabil¬ 
ities are p and q. In the terminology of random walks this means 
reflecting barriers at x = and x — a + Yi (cf. Chapter 14, section 1). 
Alternatively, the state of the system may stand for a gambler's 
fortune if the familiar gambling for unit stakes is amended by an 
agreement that whenever a player loses his last dollar he is given one 
dollar by his adversary. A continuation of the game is then always 
possible; the combined capital of the two gamblers is a + 1 and 
remains constant. The matrix of transition probabilities is now 




V 

0 

0 

... 0 

0 

0~ 


9 

0 

V 

0 

... 0 

0 

0 


0 

9 

0 

V 

... 0 

0 

0 

P = 

• 

• 



... 

• 

• 


0 

0 

0 

0 

... q 

0 

V 


_0 

0 

0 

0 

... 0 

9 

V- 


[Continued in example (6.6), problem 10, and Chapter 16, section 3.] 

V. Cyclical Random Walks. Again let the possible states be E\ y 
E 2y • • •, E a but order them cyclically so that E a has the neighbors 
E a ~i and E\. If, as before, the system always passes either to the 
right or to the left neighbor, the rows of the matrix P are as in the pre- 








312 MARKOV CHAINS {15.2 

ceding example, except that the first row is (0, p, 0, 0 , • • •, 0 , q) and 
the last (p, 0, 0, 0, • • *, 0, q , 0). 

More generally, we may permit transitions between any two states. 
Let #o, qi, • • •, q a —\ be, respectively, the probability of staying fixed 
or moving 1,2, • • •, a — 1 units to the right (where k units to the right 
is the same as a — k units to the left). Then P is the cyclical matrix 


“<7o 

Qi 

#2 

* ' ' #o—2 

#0-1 

Qa-l 

Qo 

Ql 

CO 

1 

#o-2 

Va-2 

#0—1 

#0 

* * #a—4 

CO 

1 

0 

-?1 

#2 

#3 

‘ ’ #0-1 

#0 - 


If q x = p, g 0 _ x = p, and #* = 0 for 1 < k < a — 1, then this random 
walk reduces to the simple case discussed at the beginning of this 
example. [The discussion is continued in Chapter 16, example (2.c).] 

VI. Unrestricted Random Walks. An unrestricted one-dimensional 
random walk is a Markov chain where it is most natural to order the 
states in a doubly infinite sequence ( ... Z?_ 2 , i, E 0 , E X) E 2l ... ). In 
order to write the matrix of transition probabilities in the familiar 
form, we must rearrange the states. For example, we may write them 
in the order ( E 0f E x , E_ x , E 2i E _ 2 , ...); the first row of P is then 
(0, p, q, 0, 0, ...), the second ( q , 0, 0, p, 0, 0, ...), etc. Unfortunately, 
the natural symmetry is lost, and the formulas become unpleasant. 
The situation grows even worse in two dimensions. In such cases the 
methods of this chapter are not convenient for deriving explicit for¬ 
mulas, but the general theorems apply and contain pertinent informa¬ 
tion. 

VII. The Ehrenfest Model of Diffusion. Once more we consider a 
chain with the a + 1 states E 0 , E x , • • •, E a and transitions possible 
only to the right and the left neighbor; however, this time we put 
Pjj+i = 1 — j/a and pjj-i = i/o, so that 



"0 

1 

0 

0 

... o 

0 - 


a" 1 

0 

1 — a -1 

0 

... 0 

0 

P - 

0 

2a- 1 

0 

1 - 

o 

1-1 

1 

<3 

i 

0 


* 

• 

• 

• 

... 

• 


0 

0 

0 

0 

... o 

cT l 


.0 

0 

0 

0 

... 1 

0 - 






16.21 


ILLUSTRATIVE EXAMPLES 


313 


This chain has two interesting physical interpretations. For a 
discussion of various recurrence problems in statistical mechanics 
P. and T. lShrenfest 3 described a conceptual experiment where a 
molecules are distributed in two containers A and B. At time n a mole¬ 
cule is chosen at random and removed from its container to the other. 
Let the state of the system be determined by the number of molecules 
in A . Suppose that at a certain moment there are exactly k molecules 
in the container A. At the next trial the system passes into E k -i 
or 22 *+1 according to whether a molecule in A or B is chosen; the 
corresponding probabilities are k/a and (a — k)/a, and therefore our 
chain describes Ehrenfest’s experiment. However, our chain can also 
be interpreted as diffusion with a central force* that is, a random walk 
in which the probability of a step to the right varies with the position. 
From x = j the particle is more likely to move to the right or to the left 
according as j < a/2 or j > a/2 ; this means that the particle has a 
tendency to move towards x = a/2, which corresponds to an attrac¬ 
tive elastic force increasing in direct proportion to the distance. 
[Discussion continued in example (6.c) and problem 6. 5 ] 

VIII. Occupancy Problems. In Chapter 3 we considered random 
placements of balls into a cells. Let the number of occupied cells 
determine the state of the system. If j cells are occupied, the prob¬ 
ability that the next ball is placed into an empty cell is (a — j)/a. 
Hence the experiment is described by a chain with transition prob¬ 
abilities pjj = j/a } pjj +1 = (a — j)/a, and p jtk = 0 for all other com¬ 
binations of j and k. The initial distribution (all cells empty) is given 
by p Q =a 1, p k = 0 for 1 < k < a. [Cf. Chapter 16, example (2.e).] 

IX. Sequential Sampling. In Chapter 14 (end of section 1 and 
section 8) we considered the following generalized ruin problem con¬ 
nected with sequential sampling. Given a sequence of mutually 
independent random variables X v which assume only integral values 
(positive and negative) and have a common distribution {p *}(k = 0, 
±1, =b2, ...). Put S n = Xi + • • • + X n . There exists a smallest 
subscript n for which either S n > b or S n < —z; here b and z are pre¬ 
assigned postive numbers and n is, of course, a random variable. A 

8 P. and T. Ehrenfest, tjber zwei bekannte Einwande gegen das Boltzmannsche 
H-Theorem, Physikalische Zeilschrift , vol. 8 (1907), pp. 311-314. 

4 Ming Chen Wang and G. E. Uhlcnbeck, On the Theory of the Brownian Motion 
II, Reviews of Modem Physics f vol. 17 (1945), pp. 323-342. 

5 For a more complete discussion (by methods essentially equivalent to those of 
Chapter 16) cf. M. Kac, Random Walk and the Theory of Brownian Motion, 
American Mathematical Monthly , vol. 54 (1947), pp. 369-391. See also B. Fried¬ 
man, A Simple Urn Model, Communications on Pure and Applied Mathematics , 
vol. 2 (1949), pp. 69-70. 



314 


MARKOV CHAINS 


[15.2 


general problem of sequential sampling according to Wald consists 
in finding the distribution of n and the probabilities of the two con¬ 
tingencies S n < —z and S n > b. 

We interpret this problem as follows. Put a = b + z and consider 
a Markov chain with possible states x = 0,1, 2, • • •, a. The initial 
state of the system is z. If Si has a value — z < S\ < b, then we say 
that the first step takes the system into the state x = Si + z; if 
Si < — z y the first step takes the system into x = 0, while if Si > 6, 
then a transition into x = a occurs. If one of the two limiting states 
x = 0 and x = a is reached, the system remains in it for all future time 
(which is a way of expressing that the process stops). Otherwise the 
process continues as described: at time n the system is in state 
x = S n + z provided that all partial sums Si, S 2 , • • •, S n _i lie in the 
interval — z < S* < b. Otherwise the system is in x = 0 or x = a 
according to whether the first sum S k which falls outside this interval 


is negative or positive. The matrix of transition probabilities is then 


"1 

0 

0 

0 

... 0 

0~ 

ri 

Vo 

Pi 

P2 

• • • Pa-1 

Pi 

r 2 

V- 1 

Po 

Pi 

0 

1 

to 

P2 


P-2 

P- 1 

Po 

• • • Pa-3 

P3 

r a 

V — 0+1 

V —o+2 

P—a+ 3 

••• Po 

Pa 

J) 

0 

0 

0 

... o 

1 _ 


where 

r k = p- k + v~k-\ + P-fc -2 + P_*-3 H- 

and 

Pfc = Pa-Zc-fl + Po-Jfc-f2 H-• 

As an example consider Bartky's original sampling scheme (men¬ 
tioned in section 1 of Chapter 14) which was the first sequential scheme 
to be proposed. The integer a y the so-called rejection level, is fixed. 
The lot of items which is subjected to sampling inspection must be 
large, and, for theoretical purposes, we shall assume it infinitely large. 
A preliminary sample is drawn and the lot is accepted if the sample 
contains no defective item and rejected if it contains at least a defective 
items. In either case the process terminates. The Markov chain 
starts only if the number j of defective items lies between the limits 0 






15.2] 


ILLUSTRATIVE EXAMPLES 


815 


and a, and in this case j is the initial state (so that the initial distribu¬ 
tion depends on the manner in which the preliminary sample is taken). 
The process consists in drawing successive independent samples of 
fixed size N , counting each time the number of defectives. Allowance 
is made for one defective per lot, that is, whenever the new sample con¬ 
tains exactly one defective, the state remains unchanged. If no defec¬ 
tive is found, the system moves to the next lower state, j — *j — 1 . If 
r + 1 defectives are found, the system moves from j to j + r, except 
that it moves to a if j + r > a. In practice passing to 0 means accept¬ 
ance, and passing to a rejection; sampling is continued until one of the 
two alternatives occurs. 

In our previous notation X n is the number of defectives in the nth 
sample minus one. Assuming that the number of defectives has a 
Bernoulli distribution, we have for k > 0 


( 2 . 1 ) — 
and p_i = q N , p_ 2 = p_ 3 ... =0. 


X. An Example from Genetics . 6 Consider a population which is kept 
constant in size by the selection of N individuals in each successive 
generation. We classify individuals with respect to a particular gene 
pair (A, a). There are 2N genes in the population, and if in the nth 
generation A occurs j times, then a occurs 2 N — j times. In this case 
we say that the population is at time n in state j (0 < j < 2A r ). Assum¬ 
ing random mating, the composition of the following generation is 
determined by 2N Bernoulli trials in which the A-gene has probability 
j/2N. We have therefore a Markov chain with 


( 2 . 2 ) 


Pjk 


=C N )( J -)('-- L ) 

\ k / \2N/ \ 2N) 


2W —k 


[Cf. example (8 .c).] 

XI. A Breeding Problem. In the so-called brother-sister mating two 
individuals are mated, and among their direct descendants two indi¬ 
viduals of opposite sex are selected at random. These are again mated, 
and the process continues indefinitely. If there are the three genotypes 
A A, An, aa for each parent, then we have to distinguish six combina- 

8 This problem was discussed at length by R. A. Fisher and S. Wright. The 
formulation in terms of Markov chains is due to G. Mascot, Sur un probldmc de 
probability en chaine que pose la g£n£tique, Comptes rendus de VAcadtmie des 
Sciences , vol. 219 (1944), pp. 379-381. 



316 


MARKOV CHAINS 


[16.2 


tions of parents. We order these six possible states of our system as 
follows: Ei = A A X AA, E 2 = A A X Aa , E 3 = Aa X Aa , E 4 
= la X aa, E 5 — aa X aa, Eq = AA X aa. Using the rules of 
Chapter 5, it is easily seen that the matrix of transition probabilities 
is in this case 


"1 0 0 0 0 0 " 

1/4 1/2 1/4 0 0 0 

1/16 1/4 1/4 1/4 1/16 1/8 

0 0 1/4 1/2 1/4 0 

0 0 0 0 1 0 

.0 0 1 0 0 0 . 

[The discussion is continued in example (4.c) and problem 3; a com¬ 
plete treatment is given in Chapter 16, example (4.6.] 

XII. Decomposable Chains. The following is a rather artificial 
example designed to illustrate certain points of the theory. 

Given a coin with faces Ei and E 2 , and a die with faces E 3 , • • •, E$. 
We select one of the two pieces at random and perform independent 
trials with it. In other words, the entire process consists either in 
tossing of a coin or in throwing a die, each alternative having prob¬ 
ability 1/2. The matrix of transition probabilities can be exhibited 
schematically in the form of a partitioned matrix , 



where A stands for the 2 by 2 matrix with elements 1/2, and B for the 
6 by 6 matrix with elements 1/6; the zeros indicate that the remaining 
24 elements vanish. 

Obviously our chain is an artificial combination of the two chains 
representing coin tossing and die throwing. The matrices of transition 
probabilities corresponding to these two chains are A and B. It is 
more natural to study the two chains separately, and, at any rate, all 
properties of the combined chain can be obtained from a study of the 
two component chains. We have here a typical example of a decomposi¬ 
tion (for the definition cf. section 4), and a similar procedure can be used 
for more complicated artificial combinations of several chains. 

XIII. Periodic Chains. Let trials consist in throwing alternately 
a coin and a die and number the states as in the preceding example. 



317 


13.3] HIGHER TRANSITION PROBABILITIES 

We have now for the transition probabilities the partitioned matrix 



where U is a 2 by 6, and V a 6 by 2 matrix. (The first two rows are 
0, 0, 1/6, 1/6, 1/6, 1/6, 1/6, 1/6. The last six are 1/2, 1/2, 0, 0, 0, 
0, 0, 0.) If we consider the process only at times 2, 4, 6, ..., we have 
a simple die-throwing experiment, whereas at times 1, 3, 5, ..., we 
are concerned with coin tossing. This chain has period 2. 

3. Higher Transition Probabilities 

A transition from Ej to Ek in exactly n steps can occur via differ¬ 
ent paths Ej —> E jl —+ Ej 2 —*•••—» E jn _ x —> E &. The conditional prob¬ 
ability that the system passes through this particular path if it is at Ej 
at a certain time is VjjJPhj* '' ‘ The sum of the corresponding 

expressions for all possible paths is the 'probability of finding the system 
at time r + n in state Ek , given that at time r it was in state Ej. We shall 
denote this probability by pjk n) . 

We have, in particular, pjk^ l) = Pjk , and 

(3.1) Pjk {2) = E VjvV>k- 

V 

By induction we find easily the recursion formula 

(3.2) Vjk (nJrl) = E Pj*P»k ' n) ; 

v 

a further induction on m shows that more generally 

(3.3) p jk {m+n) = E Py/ m W n) - 

V 

This equation reflects the simple fact that the first m steps lead the 
system from Ej to some intermediate state 2?„, and the last n steps 
from E v to E *. It characterizes Markov chains. For more general 
processes (cf. section 10) a similar equation holds, but the last factor 
depends not only on v and k but also on j. 

In the same way as the pjk form the matrix P, we arrange the pjk n) in a 
matrix to be denoted by P n . Equation (3.2) states that to obtain the 
element of P n + l we have to multiply the elements of the jth 

row of P by the corresponding elements of the &th column of P n and add 
all products. This operation is called row into column multiplication 
of the matrices P and P n , and is expressed symbolically by the equation 
pn+i .. ppn rpj^ su gg es t, s calling P n the nth power of P; equation 
(3.3) expresses the associative law P m+n == P m P n . 



318 


MARKOV CHAINS 


[ 15.4 


Examples, (a) In example I we have P n = P for all n. 

(6) In example II the numbering of rows and columns starts with 
zero. In the zero column of P n all elements equal q; in the first 
column they equal qp ; and, generally, for k < n — 1 all elements 
of column number k equal p k q. Moreover, we have p 0 ,n (n) = Pi, n +i (n) 
= V 2 ,n+ 2 (n) = • • • = p n - These elements are on a line parallel to the 
main diagonal. All other elements of P n vanish. 

(c) In example XII all powers of P are identical with P. 

(i d ) In example XIII the square of P equals the matrix of example 
XII; it follows that also P 2 = P 4 = .... On the other hand, P = P 3 
= P 5 = .... 

Absolute Probabilities. If the initial probability of the state Ej is 
ay, then the (unconditional) probability of finding the system at time n 
in state Ek is obviously 

( 3 . 4 ) a k {n) = 23 ajVjk {n) ■ 

3 

The most important properties of Markov chains depend on the 
asymptotic behavior of pj k ^ n) as n —> oo. Intuitively one would expect 
that the influence of the initial stage gradually wears off, so that for 
large n the probability of finding the system at time n in state E k 
should be independent of the state at time 0. We mean by this that 
Pyfc (n) tends to a limit u k which is independent of j (a property called 
ergodicity). We shall show that our intuitive surmise is generally true, 
but exceptions exist. 

4. Irreducible Chains. 

We shall say that the state E k can be reached from Ej if there exists 
some n such that Pj k (n) > 0, that is, if there is a positive probability 
of reaching E k from Ej , in n steps. It is not necessary that E k can be 
reached from Ej in one step. For example, in an unrestricted random 
walk one-step transitions are possible only to the neighboring states, 
but every state can be reached from every other state. On the other 
hand, in example XII, only E\ and E 2 can be reached from E\ or E 2 , 
while from P 3 only the states P 3 , • ■ •, E s can be reached. The follow¬ 
ing definition is designed to cope with such situations. 

Definition. A set C of states is called closed if no one-step transition 
is possible from any state of C to any state outside C, that is y if pj k — 0 
whenever Ej is in C and E k outside . A chain is called irreducible if 
there are no dosed sets other than the set of all states. 



15.4] 


IRREDUCIBLE CHAINS 


319 


If Ej is in the closed set C and Ek outside, then pjk = 0, and it follows 
from (3.1) that also pjk^ 2) = 0. More generally we conclude from (3.2) 
that pjk^ n) — 0 for every n. This expresses the intuitively obvious 
fact that no escape is possible from a closed set: no state outside a closed 
set C can be reached from any state in C. It follows that, if Ej is in C, 
then the sum of extended over all those v for which E v is also in 
C is unity. In the jth row of P n the elements corresponding to states 
in C add to unity, while all others vanish. In other words: 

If in the matrices P n all rows and all columns corresponding to states 
outside the closed set C are deleted , there remain matrices for which the 
fundamental relations (3.2) and (3.3) again hold. This means that we 
have a Markov chain defined on C, and this subchain can be studied 
independently of other states. 

Examples, (a) In example XII we have two closed sets; they are 
formed by E\ } E 2 and by E s , E H , respectively. 

(b) With the matrix of transition probabilities 

1/3 0 2/3 0 

0 1/4 0 3/4 

1/2 0 1/2 0 
0 1/2 0 1/2 

we have the two closed sets (E i, E 3 ) and (E 2 , E 4 ). Their matrices are 



respectively. If we reorder the states into the sequence Ei, E 3 , E 2f E 4 , 
then the matrix of the composite chain can be written in the form of a 
partitioned matrix 



and it is easily verified that 



The last statement obviously does not depend on the particular form 
of Pi and P 2 . 

(c) In the random walk with absorbing barriers (example III) each 
barrier is a closed set consisting of a single state. The same is true 
of the states Eq and E a in the sequential sampling example IX, of E 0 




MARKOV CHAINS 


320 


[15.5 


and E 2 n in the example X, taken from genetics, and of E\ and 2?g in 
the breeding example XI. 

(d) In example V let a be even and qi = g 3 = • • • = q a -\ = 0. Then 
the even states form one closed set, the odd states another. 


If a single state Ek forms a closed set, it will be called an absorbing 
state. A necessary and sufficient condition for Ek to be an absorbing 
state is that pkk = 1- (In this case all elements in the fcth row of P 
except the diagonal element vanish.) A chain in which there exist 
two or more closed sets is called decomposable. 

5. Classification of States 

Consider an arbitrary, but fixed, state Ej and suppose that at time 0 
the system is in Ej. Let // n) be the probability that the first return 
to Ej occurs at time n. In particular, 

fi a) = vn, fi (2) = vn m 

generally, the // n) can be calculated from the obvious recurrence re¬ 
lation 

(5.1) //»> = »/"> -fj (1 W n ~ l) -// 2 V"- 2) - tf—Vpu 

which will be used only in an indirect way. 

The sum 

(5.2) fj = £ //•> 

n—1 

may be interpreted as the probability that the system ever returns to Ej. 
If fj = 0, a return is impossible, while fj = 1 is interpreted as certainty 
of return. Once the system is back at Ej, the initial situation is re¬ 
established, and the process starts from the beginning as a replica of the 
preceding trials. Hence the return to the state Ej is a recurrent event 
as defined in Chapter 12. If fj = 1, this recurrent event is certain, and 

( 5 . 3 ) N - £ «//»> 

• n=l 

is the mean recurrence time. [Note that equation (5.1) is, except for 
notation, identical with equation (3.1) of Chapter 12.] 

From the theory of recurrent events we have the following double 

Classification of States. (1) The state Ej is called recurrent or transient 
according to whether a return to Ej is certain or uncertain (i that is , accord - 



15.51 


CLASSIFICATION OF STATES 


321 


ing to whether fj = 1 or fj < 1). A recurrent state Ej with infinite mean 
recurrence time is called a null state . 

(2) The state Ej is called periodic with period t if a return to Ej is 
impossible except perhaps , in t ) 2 1, 3 1, ... steps, and t > 1 is the greatest 
integer with this property. (In this case py/ n) = 0 whenever n is not 
divisible by t.) 

A recurrent state which is neither a null state nor periodic will be called 
ergodic. 

Examples. In examples I and II all states are recurrent. A random 
walk with absorbing barriers necessarily ends after finitely many steps 
at one of the barriers; hence in example III the states Ej with 
1 < j < a — 1 are transient. Moreover, they have period 2 since a 
return can obviously occur only after an even number of steps. The 
two limiting states E 0 and E a are recurrent and non-periodic, since 
for them a return after one step has probability one. In the case of 
of reflecting barriers (example IV) it is intuitively clear that all states 
are recurrent. However, they are non-periodic, since the system can 
stay at E x for an arbitrary time and then return to Ej. From Chapter 
14, section 7, we know that in an unrestricted symmetric random walk 
in one or two dimensions all states are recurrent null states, whereas 
in three dimensions all states are transient. In either case all states 
have period 2. 


From the fundamental theorem of Chapter 12, we get directly the 
Criterion. For a transient state Ej the series 

<1 


(5.4) 


Z Pii M 

n=l 


converges. For a recurrent null state Ej this series diverges , but 
—> 0 as n —> oo. If Ej is ergodic (that is , recurrent , but neither null 
state nor periodic ), then nj < oo and 


(5.5) 


- - 


i 

N 


If Ej has •period t and is a recurrent non-null state , then uj < and 

Pii iKt) -* -• 

(while, of course, pj/ n) = 0 for all n not divisible by t). 


(5.6) 



322 


MARKOV CHAINS 


[ 15.5 


The salient fact revealed by this criterion is that, except in the 
periodic case, p,/ n) has a unique limit; the latter is zero if Ej is a null 
state or transient, and otherwise is given by (5.5). In the periodic 
case a limit exists for the subsequence n = t, 2 1, 3 t f .... 

Now let Ej be a fixed recurrent state and Ek some other state which 
can be reached from it. Furthermore, let N be the length of the 
shortest possible path from Ej to Ek, so that Pjk {N) = ot > 0. A return 
from Ek to Ej must have positive probability, for otherwise the prob¬ 
ability of the system not returning to Ej would be at least a, and 
fj<l — a<l contrary to the assumption that Ej is recurrent. It 
follows that there exists an index M such that Pkj (M) = 0 > 0. Now 
for any n we have obviously 

(5.7) p ..(n+AT+Jtf)' > p jk W) pkk M pk .(M) = aj8 . 

and 

(5.8) m (n+ ™ > Vki (M W n W N) = W n) - 

These relations imply that the sequences and pkk (n) have the same 
asymptotic behavior, and from this we can draw important conclu¬ 
sions. To begin with, Ej was assumed recurrent, and therefore the 
series 2py/ w) diverges. From (5.8) it follows that also 'Zpkk (w) diverges, 
so that Ek must be recurrent. If —»0, then also pkk {n) 0, and 

vice versa. Finally, suppose that Ej has period t. A return to Ej is 
possible in N + M steps, so that N + M must be a multiple of t. 
It follows then from (5.7) and (5.8) that Ej and Ek must have the same 
period. 

We see thus that from a recurrent state only recurrent states can be 
reached , and they are all of the same type: either they are all null states, 
or all ergodic, or all periodic non-null states with the same period. 
Now the set of all states that can be reached from Ej is obviously closed 
and is therefore the smallest closed set containing Ej. It follows that 
in an irreducible chain every state can be reached from every other 
state, and hence if one state is transient so are necessarily all others. 
We have thus proved the important 

Theorem . In an irreducible Markov chain all states belong to the same 
class: they are all transient , all recurrent null states , or all recurrent non¬ 
null states. In every case they have the same period. Moreover , every 
state can be reached from every other state. 

In every chain the recurrent states can , in a unique manner , be divided 
into closed sets Ci, C 2 , ... such that from any stale of a given set all 
stales of that set and no other can be reached. Since each C v can be treated 



15.5] 


CLASSIFICATION OF STATES 


independently as a Markov chain , all states belonging to the same closed 
set C v are necessarily of the same class. 

In addition to the closed sets C v the chain will in general contain tran¬ 
sient states from which states of the closed sets C v can be reached (but not 
vice versa). 

Examples, (a) In the one-dimensional symmetric random walk 
(coin tossing) all states are recurrent null states of period 2; if the 
random walk is unsymmetric, all states are transient (cf. Chapter 12, 
section 3). In the random walk with absorbing barriers (example III) 
E 0 forms one closed set, and E a another closed set. All other states 
are transient. From each transient state all other states can be reached. 

(6) Consider the chain with states E Xy • • •, Eq and matrix 

" 1/2 1/2 0 0 0 0 “ 

1/2 1/2 0 0 0 0 

_ 0 0 1/3 2/3" 0 0 

0 0 2/3 1/3 0 0 

1/6 1/6 1/6 1/6 1/6 1/6 

.1/6 1/6 1/6 1/6 1/6 1/6_ 

Here we have two closed sets; C x consists of E x and E 2 , while C 2 is 
formed by P 3 and P 4 . From E 5 and Eq the system can pass either to 
the closed set C x or to C 2 and then no return is possible. Hence E 5 
and Eq are transient. Clearly each of the sets C x and C 2 can be studied 
in itself as a complete Markov chain. The transient states connect 
Ci and C 2 inasmuch as either closed set can be reached from them. The 
situation is analogous to the case of a random walk with absorbing 
barriers, where the two closed sets contained only one state each, but 
there were many more transient states. 

In general, the matrix P corresponding to a chain with two closed 
sets C i and C 2 and additional transient states can be written sche¬ 
matically in the form of a partitioned matrix 




Pi 

0 

0 ~ 

(5.9) 

P - 

0 

P 2 

0 



.A 

B 

C. 


where P x and P 2 are the matrices of transition probabilities within the 
two closed sets. The matrix P n is then of the same type with Pi, P 2 , C 
replaced by P x n , P 2 n , C n (and A and B by more complicated matrices to 



324 


MARKOV CHAINS 


[15.6 


be studied in section 8). Note that P u P 2 , and C are square matrices, 
but that A and B may be rectangular matrices. Thus, in example III, 
the two corner elements poo and p aa represent Pi and P 2 . The matrix 
C is the a — 1 by a — 1 matrix obtained by deleting the first and last 
rows and columns. Finally, A and B are single-column matrices with 
elements (q, 0, 0, • • •, 0) and (0, 0, • • *, 0, p), respectively. 

It will help to clarify ideas if we mention here that theorem 1 of the 
next section has the following 

Corollary. A finite chain can contain no null states , and it is impossible 
that all its states are transient. 

6. Ergodic Properties of Aperiodic Chains; Stationary Distributions 

In the preceding section we have described the asymptotic behavior 
of the diagonal terms p ; / n) . These results will now be used for a 
discussion of the behavior of py* (n) for an arbitrary pair j, k. In this 
section we consider mainly aperiodic and irreducible chains. For 
them we shall establish the fact stated at the end of section 3, namely, 
that the probability of finding the system at time n in state Ek is in 
the limit independent of the initial state. 

Let fjk^ be the probability that, starting from Ej, the system 
reaches E k for the first time at the nth step (first passage through Ek 
starting from Ej). We have clearly 

Pik (n) = fik m +fj k (n ~ l) pkk 

( 6 . 1 ) 

+ /y fc ( ”- 2 W 2) + ••• +/y t (1 W n - 1) . 

This relation is a direct generalization of (5.1) and enables us to cal¬ 
culate recursively the/,jfc (n) in terms of the pyfc (n) or vice versa. 

Theorem 1. If the state Ek is either transient or a recurrent null state , 
then pyfc (n) —> 0 for every j. 

(Note that here periodic chains are not excluded.) 

Proof. The criterion of section 5 assures us that p*fc (n) —>0. Hence 
for every fixed N the last N terms in (6.1) tend to zero. The first 
n — N terms add to at most fjk {n) + /;fc (w “ 1) + • • • + fjk (N+l) ; their 
sum is therefore less than the Nth remainder of a convergent series 
and can be made arbitrarily small by choosing N large enough. 

This proves the theorem. If the chain contains only a finite number, 
a, of states, then for at least one k we must have pjk (n) > 1/a . It is 
then impossible that all pyfc (n) should tend to zero, and this proves the 
corollary stated at the end of section 5. 



15 . 6 ) 


ERGODIC PROPERTIES 


325 


Theorem 2. Suppose that the states of an irreducible chain are aperiodic 
and neither transient nor null states. Then for every pair j, k the limit 

(6.2) lim pnf n) = Uk > 0 

n—>oo 

exists and is independent of j. The reciprocal of Uk is the mean recurrence 
time nic of Ejc. Moreover, {uk} is a probability distribution with positive 
elements, that is, 

(6.3) u k > 0, 2 u k = 1. 

Finally, the Uk satisfy the system of linear equations: 

(6.4) u t = 22 UvPvh- 

V 

The distribution \uk\ is uniquely determined by (6.4) and (6.3) or, more 
precisely, if [vk } is any other sequence satisfying the conditions 

(6.5) v k = 22 v»Pvk, 22 | i'k | < <», 

V 

then Uk = cvk with a constant c. 

Proof. By assumption each state Ek has a finite mean recurrence 
time uk . Put u k = 1/W We know from (5.5) that pkk^ n) —> Uk, so 
that (6.2) holds when j = fc. Now for any fixed j we have 2/yjfc (n) = 1, 
for otherwise the system would have a positive probability of passing 
from Ek to Ej and never to return, which contradicts the hypothesis 
that Ek is recurrent. Hence we can to every e > 0 select an N so that 
fjk™ + • • • + fjk^ N) > 1 — €. The last N terms on the right side in 
(6.1) differ arbitrarily little from Uk\fjk {1) + • • • + fjk {N) }, and hence 
from Uk ; the sum of the first n — N terms is less than the Nth remainder 
in the series S fjk^ n \ and hence less than c. This proves (6.2). 

To prove (6.4) we first note that 

(6.6) S Uk < 1. 

This follows directly from the fact that for fixed j and n the quantities 
Pjfc (n) (fc = 1, 2, ...) add to unity, so that iq + u 2 + • • • + u N < 1 
for every N. Now put n = 1 in (3.3) and let m —> <x>. The left side 
tends to Uk , and the general term of the sum on the right side tends to 
u v p v k. Adding an arbitrary finite number of terms, we see that 

(6.7) Uk X) ^vPvh* 

V 

Summing these inequalities over all k, we obtain the finite quantity 



326 MARKOV CHAINS * [15.0 

2 Uk on each side. This shows that in (6.7) the inequality is impossible, 
and thus (6.4) is proved. 

If the sequence satisfies (6.5), then we may multiply the equa¬ 
tion in (6.5) by p kr and sum over all k. We get 

(6.8) v r = 22 v v p vr ™. 

V 

Repeating the same operation, it follows that for every n 

(6.9) Vr = 22 t>„P„r (n) . 

V 

Now we have assumed that the series of the coefficients v v converges 
absolutely. We can, therefore, in (6.9) let n <x> and obtain in the 
limit 

(6.10) Vr = (Vi + V 2 + v 3 + . . .)w r . 

The sum in the parentheses is a constant independent of r, and this 
proves that the ratio v r /u r is constant. Finally, putting v k = u kj we 
find that 2 u k = 1. This accomplishes the proof. 

Examples, (a) In example II, the system (6.4) reduces to 
Uo = (uo + U\ + •••)<? = <2, and u k = u k -ip for k > 1. The solution 
is obviously u k = p k q, in agreement with the fact found in example 
(3.5) that Pj k (n) —> p k q. 

( b ) Random Walk with Reflecting Barriers. In example IV the 
system (6.4) reduces to 


qu\ + qu 2 = ui 
(6.11) pu k ^i + qu k+ 1 = u k 


(k = 2, 3, • • *, a — 1) 


pu a -1 + pu a = U a 

Then u k = u k -i(p/q) and hence u k = Ui{p/q) k ~ l . The value of u x 
follows from the condition 2 u k = 1. The final result is 

(6.12) u k = • (p/q) 1 *- 1 

1 _ ( p / 9 )« 

Up 9 * q, and u k = 1/a if p = q = 1/2. Wherever the system starts at 
time 0, the probability of finding the system at time n in state E k is, in 
the limit, given by u k . If p = q, all states become equally likely, 
while in the case p < q the states near the left barrier are more probable. 



15 . 6 ] 


ERGODIC PROPERTIES 


327 


(c) The Ehrenfest Model. In example VII the equations (6.4) take 
on the form 


M* 


(6.13) 


_ A * " 1\ , k + 1 

I 1-) u k-l H- 

\ a / a 


(k = 1 , •••, a - 1 ) 


u x U a -i 

U 0 = — , U a = -« 

a a 


It is easily verified that the required solution is 


(6.14) 



This is a binomial distribution, and the result can be interpreted as 
follows: whatever the initial number of molecules in the first container, 
after a long time the probability of finding exactly k molecules in it is 
the same as if the a molecules had been distributed at random, each 
molecule having probability 1/2 to be in the first container. This is 
a typical example of how our result gains physical significance. 

The normal approximation to the binomial distribution shows that, 
if a is large, then, once the limiting distribution (6.14) is established, we 
are practically certain to find about one-half of the molecules in each 
container. To the physicist a = 10 6 is a small number, indeed. But 
even with a = 10 6 molecules the probability of finding more than 
505,000 molecules in one container (density fluctuation of about 1 per 
cent) is of the order of magnitude 10“ 23 . With a = 10 8 a density 
fluctuation of one in a thousand has the same negligible probability. 
It is true that the system will occasionally pass into very improbable 
states, but their recurrence times are fantastically large as compared 
with the recurrence times of states near the equilibrium. Physical 
irreversibility manifests itself in the fact that, whenever the system 
is in a state far removed from equilibrium, it is much more likely to 
move towards equilibrium than in the opposite direction. 

(i d ) Doubly Stochastic Matrices. The matrix P is called doubly 
stochastic if not only the row sums, but also the column sums, are 
unity. Suppose that the chain contains only a finite number, a, of 
states. The system (6.4) has then obviously the solution ti* = 1/a. 
It follows that, if a finite irreducible aperiodic chain has a doubly 
stochastic matrix P, then = 1/a (that is , in the limit all states become 
equally probable ). It is easily seen that the condition that P be doubly 



328 


MARKOV CHAINS 


[ 15.0 


stochastic is not only sufficient but also necessary. Clearly F" is again 
a doubly stochastic matrix and, hence, if the finite matrix P is doubly 
stochastic, no state can be transient (cf. problem 9). 

It should be remembered that our theorems apply also to reducible 
chains, since each closed set can be treated separately. Suppose that 
Ej is a recurrent state of an aperiodic irreducible subchain C. The 
behavior of Pjk {n) for all states Ek in C is described in theorems 1 and 2. 
If Ek is outside C, then Vjk {n) = 0 for all n. We lack only information 
concerning py& (n) in the periodic case, and if Ej is transient and Ek 
recurrent. The periodic case will be dealt with in the next section, the 
transient case in section 8. 

We can reformulate our theorems in terms of the absolute probabili¬ 
ties {a* (n) } introduced at the end of section 3. It follows from (3.4) 
that (6.2) implies 

(6.15) u*. 

Our theorems admit 

Corollary /. In every aperiodic irreducible chain the probability 
a/b (n) of finding the system at time n in state Ek tends to a uniquely deter¬ 
mined limit which is independent of the initial distribution. If all states 
are transient or null states , a^ n) —> 0 for all k. Otherwise the probability 
of Ek is, in the limit, the reciprocal of the ( finite) mean recurrence time 
of E k . 

Stationary Distributions. The initial probability distribution {a*} 
is called stationary if the probabilities {a* (n) } are independent of n, 
that is, if a* (n) = a^. The physical significance of stationarity becomes 
apparent if we imagine a large number of processes going on simul¬ 
taneously. Let, for example, N particles perform independently the 
same type of random walk. At time n the expected number of particles 
in state Ek is Nak (n \ With a stationary distribution these expected 
numbers remain constant, and we observe (if N is large so that the 
law of large numbers applies) a state of macroscopic equilibrium 
maintained by a large number of transitions in opposite directions. 
Most statistical equilibria in physics are of this kind. 

Corollary I asserts that ajfe (n) has a limit as n —► oo; for a stationary 
distribution ak must coincide with this limit. On the other hand, under 
the conditions of theorem 2 the limits {uk} form a probability distri¬ 
bution, and (6.4) shows that this distribution is stationary. Hence it 
is the unique stationary distribution, and we have 



PERIODIC CHAINS 


\6.7] 

Corollary II. If all states of an aperiodic irreducible chain are 
transient or null states f there exists no stationary distribution; otherwise 
there exists a unique stationary distribution {w&} and the probability 
distribution {a& (n) } necessarily converges towards it. (For finite chains 
only the second alternative is possible.) 

We have seen that in physics the convergence of a* (n) to Uk may be 
interpreted as tendency towards a state of equilibrium . A typical example 
where no state of equilibrium exists is the one-dimensional unrestricted 
random walk. If p q, there exists a drift, and all states are tran¬ 
sient; whatever the number of particles, after a long time they will 
have drifted away towards infinity. If p = q = 1/2, all states are 
recurrent, but the tendency towards equilibrium requires all states to 
become, in the limit, equally probable, so that the probability of each 
state tends to zero. 

* 7. Periodic Chains 

In the preceding section we have excluded the case of periodic chains, 
but this was done only in order not to obscure salient facts by com¬ 
plicated descriptions. A characterization of the asymptotic behavior 
of pjk (n) in irreducible periodic chains can be easily derived from 
the theorems of the preceding sections. We give such a derivation for 
the sake of completeness, but the results of this section will not be 
used in the sequel. 

By the theorem of section 5 all states of an irreducible chain 
have the same period t. Consider now any two states Ej and Ek of an 
irreducible chain with period t. Since every state can be reached 
from every other, there exist integers a, b such that p^ (o) > 0 and 
Pkj (b) > 0. Now Pj^ a+b) > Pjk^Pkj^, which shows that a return to 
Ej in a + b steps is possible, so that a + b is necessarily divisible by 
the period t. It follows that, if Ek can be reached from Ej in a x and in 
a 2 steps, then a 2 — a x must be divisible by t , and hence a division of 
a x and a 2 by t will leave the same remainder. 

Accordingly, for each fixed state Ej there corresponds to every state 
Ek a certain remainder v (with v = 0,1, •••,<— 1), so that a transi¬ 
tion from Ej to Ek is possible only in v, v + t y v + 2t, v + 3t y ... steps. 
We put j = 1 and have then a classification of all states into t groups 
Go, (?i, • • •, G t - u in the following way. If pu (a) > 0 and a = nt + v 
(where v is the remainder so that 0 < v < t), then Ek belongs to G v . 

* Starred sections treat special topics and may be omitted at first reading. 



330 * MARKOV CHAINS [15.7 

We imagine the G„ ordered cyclically so that Go and G t —1 become 
neighbors. 

It follows in particular that a one-step transition from a state in G v 
will always lead to a state in the next following group G v +\ (or Go in 
case v = t — 1); a two-step transition will lead to a state in G v+2 

(from Gt 2 it leads to G 0 , from G t -i to Gi), etc. Finally, a <-step 

transition leads necessarily to a state belonging to the same group. 
This means that, in a Markov chain whose matrix of transition prob¬ 
abilities is P\ each group G v forms a closed set. Since the original 
chain is irreducible, each state can be reached from every other. 
This implies that in the chain with transition probabilities P l each G v 
forms an irreducible closed set. We have thus the 

Theorem . In an irreducible periodic Markov chain all states can 
be divided into t groups G 0 , • • G*_i, so that a one-step transition from a 

state of G v always leads to a state of G v+ 1 (to G 0 if v = t — 1). If we 
consider the chain only at times t, 2 1> 3 1, ..., then we get a new chain 
whose matrix of' transition probabilities is P l . In it each G v forms an 
irreducible closed set. 


Examples, (a) In an unrestricted random walk all states have 
period 2. In one dimension the group G 0 is formed by all even posi¬ 
tions, G\ by all odd ones. In more dimensions the same statement is 
true if a position is called even or odd according to whether the sum of 
its coordinates is even or odd. 

(6) In example XIII the states E x and E 2 form G 0 , and the remaining 
six states G\. 

(c) Consider six states E if • • •, E$ with the matrix 


(7.1) 


0 1/2 1/2 

0 0 0 

0 0 0 

1 0 0 

1 0 0 

-10 0 


000- 

1/3 1/3 1/3 
1/3 1/3 1/3 
0 0 0 
0 0 0 
0 0 0 . 


From Ei the system necessarily passes to E 2 or 2? 3 . From E 2 and 2? 3 
transitions are possible only to the states E 4 , E 5 , E e , and from any of 
these the system necessarily passes to E lt Consequently, the chain has 
period 3. The group Go consists of only E\. The group G\ is formed 



15.7) PERIODIC CHAINS 

by E 2 and Z? 3 , the group G 2 by E 4 , E 5 , E 6 . We have 


331 


■0 0 0 1/3 1/3 1/3" 

1 0 0 0 0 0 

0 1 0 0 0 0 0 

p 2 = 

0 1/2 1/2000 
0 1/2 1/2 0 0 0 

.0 1/2 1/20 0 0 . 

and then periodically P 4 = P, P 5 


'1 

0 

0 

0 

0 

0 ’ 

0 

1/2 

1/2 0 

0 

0 

0 

1/2 

1/2 0 

0 

0 

0 

0 

0 

1/3 1/3 1/3 

0 

0 

0 

1/3 1/3 1/3 

.0 

0 

0 

1/3 1/3 1/3. 


P 2 , etc. 


Our theorem contains complete information concerning the asymp¬ 
totic behavior of Pjk {n) - If all states are transient or null states, then 
pjk (n) —> 0 for every pair j, k (theorem 1 of section 6). Otherwise each 
state Ek has a finite mean recurrence time yk- Suppose that Ej belongs 
to G v . On G v we have an irreducible non-periodic Markov chain 
with transition probabilities Pjk {t \ and hence (by theorem 2 of section 
6) there exist the limits 

u k if E k is in G„ 

(7.2) lim p jk (nt) = 

n ~** > 0 otherwise. 


Here Uk is the reciprocal mean recurrence time of Ek in the new chain, 
one step of which corresponds to t steps of the original chain. Hence 


(7.3) 



Using (3.2), we find from (7.2), 

(7.4) lim p jk (nt+1) = k 

n—+oo 0 


if Ek is in G v +i 
otherwise. 


Similarly, pjk (nt+2) —» u k if E k is in G v + 2 , etc. In other words, for 
fixed Ej and Ek the sequence Pjk {n) is asymptotically periodic; in it blocks 
of t — 1 consecutive zeros alternate with a positive element which converges 
tO Uk = t/y,k • 

By theorem 2 of section 6, the Uk within each group G v add to unity. 
Since there are t blocks, it follows from (7.3) that the sequence {1/jufc} 
represents a probability distribution. The argument of section 6 
shows directly that this distribution is stationary and that no other 
stationary distributions exist . 



332 


MARKOV CHAINS 


[ 15.8 


8. Transient States; Absorption Probabilities 

In the two preceding sections we have completely described the 
asymptotic behavior of pjk (n) for the case where Ej is recurrent. If Ek 
is transient, then py& (n> —► 0 for all j (theorem 1, section 6). It remains 
to investigate the case where Ej is transient and Ek recurrent. This 
is a direct generalization of the classical ruin problem or random walk 
with two absorbing barriers (example III). In that particular case 
the two absorbing states E 0 and E a are the only recurrent states, and 
p j0 (w) and Pja n) are, respectively, the probabilities that the gambler 
or his adversary will be ruined at the nth step or before, assuming 
that their initial capitals are j and a — j. In example IX (sequential 
sampling) we have a similar situation. 

In the general case the recurrent state Ek will belong to a closed 
set C containing more than one state. Once the system is in C, it will 
remain there and continue occasionally to pass through Ek . We seek 
the probability Xj that the system, starting from the transient state 
Ej , will ultimately land in the closed set C. 

Suppose that the .system is initially in the transient state Ej and let 
x / n) be the*probability that at time n, and not sooner , the system reaches 
the closed set C. Then 

GO 

(8.1) xj = £ */»> 

n=l 

defines the probability that the system will ultimately reach and stay in C. 
By analogy with the simple random walk we shall call Xj the probability 
of absorption in C. The difference 1 — Xj accounts for the possibility 
of absorption in other closed sets and (in the case of some infinite 
chains) of an indefinite continuation in transient states. 

It is clear that 

(8.2) Xj w = £ p jk , 

c 

the summation extending over those k for which Ek is contained in C. 
If the system reaches C at the (n + l)th step, then the first step must 
lead from Ej to another transient state. It is therefore clear that 

( 8 . 3 ) *,<•+«_ £^,00 

T 

the summation now extending over those v for which E, is transient. 
Equations (8.2) and (8.3) are recurrence relations which uniquely deter¬ 
mine the Xj (n) . Adding (8.3) for n = 1, 2, 3, we find that the 



15.8] ABSORPTION PROBABILITIES 333 

absorption probabilities Xj are solutions of the system of linear equations 
(8.4) x s - 22 Pj,x v = Xj m . 

T 


Examples, (a) Random Walk with Absorbing Barriers (example III). 
Take for C the absorbing state E 0 . Then = q and a;/ 1 * = 0 if 
j > 1. The system (8.4) therefore reduces to 


x x - px 2 = q, 

(8.5) xj - qxj _i - pxj+x =0 (j = 2, 3, • • •, a - 2), 

#a—1 ~ = 0. 


This is the same as the system (2.1)-(2.2) of Chapter 14, and the 
solution is given in (2.4). 

(i b) Sequential Sampling (example IX). Again let C be the state E 0 . 
Then x y (1) = ry, and the equations (8.4) reduce to (8.2) of Chapter 14 
(where u x stands for the present xj) cf. also problem 18 of Chapter 14). 

(c) Genetics (example X). Here each of the two states E 0 and E 2 n 
forms a closed set. Absorption in E 0 and in E 2 n signifies, respectively, 
that the population ultimately consists only of aa- or only of AA-indi- 
viduals. For the absorption in E 0 we have .ry (1) = p j0 = (1 — j/2N) 2N \ 
and hence (8.4) assumes the form 




It is plausible that at a moment when the A- and a-genes are in the 
proportion j:2N — j their survival chances should be in the same 
ratio. This suggests that the solution to (8.6) should be Xj = 1 — j/2N. 
This is easily verified, using the fact that (8.6) contains the terms of 
the binomial distribution with mean 2N(j/2N) = j. 

(i d ) Example (5.6). Let C consist of E\ and E 2 . Then x 5 (1) = a*6 (1) 
= and equations (8.4) take on the form x 5 — (a* 5 + x 0 )/6 = 
and Xq — (x 5 + ^e)/6 = }£. The solution is x 5 = x 6 = as should 
be expected for reasons of symmetry. 


Once the system is within the closed set C, the process continues as 
described in the preceding two sections. In particular, if C is not 
periodic and the system is known to be in C, then the probability of 
finding it in the particular state of C tends to l//z&, where pk is the 



334 MARKOV CHAINS [i5.8 

mean recurrence time of It follows easily that, if Ej is transient, 


(8.7) 


P;fc (n) “> — 
Mfc 


Similarly, if C is periodic, then also Pjk (n) will be asymptotically 
periodic. We have thus completed the description of the asymptotic 
behavior of pjk (n) for all cases. 


Two questions have been left open: (1) Is the solution of the system of linear 
equations (8.4) unique? .(2) What is the probability that the system will continue 
to pass from transient state to transient state without ever reaching a recurrent 
state? The two questions are closely related. 

Let Ej be transient and let ?yy (7l) be the 'probability that the system is at time n in a 
transient state , given that it started from Ej at time 0. Obviously 

Vj m - 22 Piy, 

T 

( 8 . 8 ) _ 

Vi ( n+ i > = 22 Pivy, (n \ 


the summations again extending over all v for which E v is transient. It follows 
from (8.8) that r/y (1) < 1 and hence t// 2) < yyy (1) , and generally yj in+l) < ?y/ n) . 
Therefore a limit 

(8.9) yj = lim y/ n) 

n —*» 

exists; yj is the probability of the system forever staying in transient states. From 
(8.8) we have 

(8.10) Vj = 22 Pj-rVy. 

T 

If yj is any solution of (8.10) with | yj | < 1, then a comparison of (8.10) and 
(8.8) shows first that | yj | < | ?y, (1) |, and then by induction that \yj \ < yj {n) for 
all n. Hence we have the 

Theorem: the probability yj of the system forever remaining in the set of transient 
states satisfies the system of linear equations (8.10). This probability is zero for all j 
if and only if the system (8.10) has no bounded solution except yj se 0. This is always 
the case for finite chains. For, if there are only finitely many yj, let M be their 
maximum. From (8.10) we have 

(8.11) M <22 PjyM, 

T 


and the equality sign can hold only if ( j being fixed) the Pj v for which y v = 'M add 
to unity. In this case, however, these y v would form a closed set, and since the 
chain is finite, not all of them could be transient. 

If now (8.4) has two different bounded solutions, then their difference \yj) is a 
solution of (8.10). Conversely, if (8.10) has a solution yj, then Xj -f yj is a new solu¬ 
tion of (8.4). Hence: 

For the solution Xj of (8.4) to be unique it is necessary and sufficient that the prob¬ 
ability yj of the system forever remaining in transient states is zero for every initial 
transient Ej . This is always the case if the chain is finite . 



15 . 9 ] 


CARD SHUFFLING 


335 


Duration of the Game. For a given initial transient state Ej let Yj be the time 
when the system for the first time passes into a recurrent state (so that Yj — 1 is 
the number of steps preceding absorption in some closed set). In some of our 
examples the random variable Yj is the duration of the game, and we use this 
term generally. Clearly Pr{Yj = n\ = ?y/ n_1) - yj {n) ; these probabilities add to 
unity if, and only if, yj = 0. In this case 

(8.12) dj = 2nPr{Yj = n\ = y/»> 

n=0 

is the mean duration of the game. From (8.8) it follows that the mean duration is 
the solution of the system of linear equations , 

(8.13) dj - X) Pi A = I- 

T 

The solution is uniquely determined whenever yj — 0, that is, whenever there is 
certainty that the game will end after finitely many steps. (If this is not the case, 
there is no finite mean duration.) (Cf. problem 18 of Chapter 14.) 

9. Application to Card Shuffling 

A deck of A cards numbered 1, 2, • • *, A can be arranged in A! 
different orders, and each represents a possible state of the system. 
Every particular shuffling operation effects a transition from the 
existing state into some other state. For example, “cutting” will 
change the order (1, 2, • • •, A ) into one of the A cyclically equivalent 
orders (r, r + 1, • • •, A, 1, 2, • • •, r — 1). The same operation applied 
to the inverse order (A, A — 1, • • •, 1) will produce (A — r + 1, 
A — r + 2, • • •, 1, A, A — 1, • • •, A — r). In other words, we con- 
conceive of each particular shuffling operation as a transformation 
Ej —> Ek> If exactly the same operation is repeated, the system will 
pass (starting from the given state Ej) through a well-defined succession 
of states, and after a finite number of steps the original order will be 
re-established. From then on the same succession of states will recur 
periodically. For most operations the period will be rather small, and 
in no case can all states be reached by this procedure. 7 For example, a 
perfect “lacing” would change a deck of 2m cards from (1, • • - , 2m) 
into (1, m + 1,2, m + 2, • ■ •, m, 2m). With 6 cards four applications of 
this operation will re-establish the original order. With 10 cards the 
initial order will reappear after six operations, so that repeated perfect 
lacing of a deck of 10 cards can produce only six out of the 10! 
= 3,628,800 possible orders. 

In practice the player may vary the operations, and certainly the 
play of chance will introduce variations, so that even a player attempt- 

7 In the language of group theory this amounts to saying that the permutation 
group is not cyclic and can therefore not be generated by a simple operation. 



336 


MARKOV CHAINS 


[15.9 


ing to achieve identical operations will not always be successful. We 
shall assume that we can account for the player’s habits and the influ¬ 
ence of chance variations by assuming that every particular operation 
has a certain probability (possibly zero). We need assume nothing 
about the numerical values of these probabilities, but shall suppose 
that the player operates without regard to the past and does not know 
the order of the cards. 8 This implies that the successive operations 
correspond to independent trials with fixed probabilities: for the actual 
deck of cards we then have a Markov chain. 

We now show that the matrix P of transition probabilities is doubly 
stochastic [example (6 .d)]. In fact, if an operation changes a state 
(order of cards) Ej to Ek, then there exists another state E r which it 
will change into Ej. This means that the elements of the jth column of 
P are identical with the elements of the jth row, except that they ap¬ 
pear in a different order. All column sums are therefore unity. 

It follows [example (6.d)] that no state can be transient. If the 
chain is irreducible and aperiodic, then in the limit all states become 
equally probable. In other words, any kind of shuffling will do, provided 
only that it produces an irreducible and aperiodic chain. It is 
safe to assume that this usually is the case. Suppose, however, that 
the deck contains an even number of cards and the procedure consists 
in dividing them equally into two parts and shuffling them separately 
by any method. If the two parts are put together in their original 
order, then the Markov chain is reducible (since not every state 
can be reached from every other state). If the order is inverted, the 
chain will have period 2. Thus both contingencies can arise in theory, 
but hardly in practice, when action of chance precludes perfect regu¬ 
larity. 

It is seen that continued shuffling may reasonably be expected to 
produce perfect “randomness” and to eliminate all traces of the original 
order. It should be noted, however, that the number of operations 
required for this purpose is extremely large. 9 

•This assumption corresponds to the usual situation at bridge. It is easy to 
devise more complicated shuffling techniques in which the operations depend on 
previous operations and the final outcome is not a Markov chain [cf. example 
( 10 . 6 )]. 

* For an analysis of unbelievably poor results of shuffling in records of extra¬ 
sensory perception experiments cf. W. Feller, Statistical Aspects of ESP, Journal 
of Parapsychology , vol. 4 (1940), pp. 271-298. In their amusing A Review of Dr. 
Feller's Critique, ibid., pp. 299-319, J. A. Greenwood and C. E. Stuart try to show 
that these results are due to chance. Both their arithmetic and their experiments 
have a distinct tinge of the supernatural (cf. Chapter 2, problem 16 of section 8). 



337 


15.10] THE GENERAL MARKOV PROCESS 

10. The General Markov Process 

In applications it is usually convenient to describe Markov chains 
in terms of random variables. This can be done by the simple device of 
replacing in the preceding sections the symbol E k by the integer ft. 
The state of the system at time n then is a random variable A (n) , which 
assumes the value k with probability a* (n) ; the joint distribution of 
X {n) and -Y (n+1) is given by Pr{X (n) = j, JST (n+1) = k} = a^ n) p jkf 
and the joint distribution of (X a) , • • •, X (n) ) is given by (1.1). It is 
also possible and sometimes preferable to assign to E k a numerical 
value e k different from k . With this notation a Markov chain becomes 
a special stochastic process, 10 that is, a sequence of (dependent) random 
variables (AT (0) , AT (1) , .. .), every finite collection of which has a well- 
defined joint probability distribution. 11 The superscript n plays the 
role of time. In Chapter 17 we shall get a glimpse of more general 
stochastic processes in which the time parameter is permitted to vary 
continuously. The term “Markov process*’ is applied to a very large 
and important class of stochastic processes (with both discrete and 
continuous time parameters). Even in the discrete case there exist 
more general Markov processes than the simple chains which we have 
studied so far. It will, therefore, be useful to give a definition of the 
Markov property, to point out the special condition characterizing our 
Markov chains, and, finally, to give a few examples of non-Markovian 
processes. 

Conceptually, a Markov process is the probabilistic analogue of the 
processes of classical mechanics, where the future development is 
completely determined by the present state and is independent of the 
way in which the present state has developed. The processes of me¬ 
chanics are in contrast to processes with aftereffect (or hereditary 
processes), such as occur in the theory of plasticity, where the whole 
past history of the system influences its future. In stochastic processes 
the future is never uniquely determined, but we have at least prob¬ 
ability relations enabling us to make predictions. For the Markov 
chains studied in this chapter it is clear that probability relations 
relating to the future depend on the present state, but not on the 
manner in which the present state has emerged from the past. In 
other words, if two independent systems subject to the same transition 
probabilities happen to be in the same state, then all probabilities 

10 The terms “stochastic process” and “random process” are synonyms and cover 
practically all the theory of probability from coin tossing to harmonic analysis. 
In practice, the term “stochastic process” is used mostly when a time parameter 
is introduced. 

11 It is clear that these joint distributions must be mutually consistent. 



MARKOV CHAINS 


338 


[15.10 


relating to their future developments are identical. This is a rather 
vague description which is formalized in the following 

Definition . A sequence of discrete-valued random variables is a Markov 
process if for every finite collection of integers ni < n 2 < • • • < n r < n 
the joint distribution of (X ini \ X (n2 \ • • •, X (nr \ X n ) is defined in such a way 
that the conditional probability of the relation Jf (n) = x on the hypothesis 
that JT (ni) = xi, •••,X M = x r is identical with the conditional prob¬ 
ability of JT (n) = x on the single hypothesis Af (nr) = x r . Here X\, • • •, 
x ry x are arbitrary numbers for which the hypothesis has a positive prob¬ 
ability . 

Reduced to simpler terms, this definition states that, given the 
state x r at time n r , no additional data concerning states of the system 
at previous times can alter the (conditional) probability of the state 
£ at a future time n. 

The Markov chains studied in this chapter are obviously Markov 
processes, but they have the following additional property not im¬ 
plied by the definition. For the Markov chains studied in the preceding 
sections the transition probabilities pjk = Pr{X {m+l) — k | X (m) — j) are 
independent of m. The more general transition probabilities 

(10.1) Pih in ^° = Pr{X (n) = k | X (OT) = j} (m < n) 

then depend only on the difference n — m. One says in this case that 
the transition probabilities are stationary (or constant). For a general 
integral-valued Markov chain the right side in (10.1) depends on m 
and n. We shall denote it by pjk(rn y n) so that pjk(n 7 n + 1) is the one- 
step transition probability at time n. Instead of (1.1) we get now for the 
probability of the path (j 0 , ji, • * - ,j n ) the expression 

(10.2) oyo (0) Piw,(°. !) PhhO-y 2) • • • p jn _ 1Jn (n - 1, n). 

The proper generalization of (3.3) is obviously the identity 

(10.3) Pjk(m, n) = £ p jp (m, r ) p„ k (r, n) 


which is valid for all r with m < r < n. This identity follows directly 
from the definition of a Markov process and also from (10.2); it is 
called the Chapman-Kolmogorov equation. 

In the present chapter we have dealt mostly with the asymptotic 
behavior of the higher transition probabilities, and few of the es¬ 
tablished properties are common to the most general discrete Markov 
process. We shall, therefore, not dwell on the general theory. 

Examples of Non-Markovian Processes, (a) The Polya Urn Scheme 
[Chapter 5, example (2.c)]. Let Jf (w) equal 1 or 0 according to 



15.10] 


THE GENERAL MARKOV PROCESS 


339 


whether the nth drawing results in a black or red ball. The sequence 
{} is not a Markov process. For example, 

Pr\X < 3 > - 1 | *< 2 > = 1} = (6 + c)/(b + r + c), 

but 

Pr{X™ = 1 | X (2) = 1, X (1) = 1} = (6 + 2c)/(b + r + 2c). 

(Cf. Chapter 5, problems 16-17.) On the other hand, if F (n) is the 
number of black balls in the urn at time n, then { Y (n) } is an ordinary 
Markov chain with constant transition probabilities. 

( b ) Higher Sams. Let F () , Yi, ... be mutually independent random 
variables, and put S n = Y 0 + • • • + Y n . The difference S n — S m 
(with m < n) depends only on Y m _ fl , • • •, Y n , and it is therefore easily 
seen that the sequence {S n } is a Markov process. Now let us go one 
step further, and define a new sequence of random variables U n by 
U n = So + Si + • • • + S n (which means that 

Un = Y n + 2 Yn_! + 3 Y n _ 2 +•■•)• 

The sequence { U n } forms a stochastic process whose probability rela¬ 
tions can, in principle, be expressed in terms of the distributions of the 
Yk. The { U Tl ) process is in general not of the Markov type, since there 
is no reason why, for example, Pr [ U n — 0 | U n -1 = a) should be the 
same as Pr[U n = 0 | £7 n _ x = a, t/ n _ 2 = h ); the knowledge of U n -1 
and U n _ 2 permits better predictions than the sole knowledge of U n _\. 

In the case of a continuous time parameter the preceding summations 
are replaced by integrations. In diffusion theory the Y n play the role 
of accelerations; the S n are then velocities, and the U n positions. If 
only positions can be measured, we are compelled to study a non- 
Markovian process even though it is indirectly defined in terms of a 
Markov process. 

(c) Moving Averages . Again let \Y n } be a sequence of mutually 
independent random variables. Moving averages of order r are defined 
by X in) = (Y n + y n+1 + • • • + Y n+r - i)/r. It is easily seen that the 
-JT (n) are not a Markov process. Processes of this type are common in 
many applications. (Cf. problem 18.) 

(i d ) A Traffic Problem . For an empirical example of a non-Markovian 
process R. Fiirth 12 made extensive observations on the number of 
pedestrians on a certain segment of a street. An idealized mathematical 
model of this process can be obtained in the following way. For 

12 R. Ftirth, Schwankungserscheinungen in der Physik, Sammlung Vieweg , 
Braunschweig, 1920, pp. 17ff. The original observations appeared in Physikalische 
Zeitschrift , vols. 19 (1918) and 20 (1919). 



340 


MARKOV CHAINS 


' [15.10 


simplicity we assume that all pedestrians have the same speed v; also, 
we consider only pedestrians moving in one direction. At time t = 0 
we divide the positive x-axis into segments of fixed length 5, each of 
which may or may not contain a pedestrian. We suppose that the 
distribution of pedestrians in our segments is determined by a sequence 
of Bernoulli trials. In other words, we have a sequence of independent 
random variables Yk, each of which assumes the values 1 or 0 with 
probabilities p and q , respectively. The segment (fe — 1)8 < x < k8 
contains a pedestrian if Yk = 1. Let now the whole axis move with 
velocity v in the negative direction, and let us observe the number of 
pedestrians in the fixed interval of length N8 , which at time t — 0 is 
covered by the interval 0 < x < N8 of the moving x-axis. At time t 
this fixed interval is covered by the interval vt < x < vt + N8 of the 
x-axis. Let observations be made at times n8/v and let A r(n) be the 
number of pedestrians in our fixed interval observed at time n. Then 
Jf (n) = Y n + Y n+ 1 + • • • + y n _|_Ar-i, so that our process is, except 
for the factor 1/N f a moving average process. It is therefore non- 
Markovian. (Passing to the limit 8 —> 0, we obtain a continuous 
model, in which a Poisson distribution takes over the role of the 
binomial distribution.) 

(e) Superposition of Markov Processes (Composite Shuffling). There 
exist many technical devices (such as groups of selectors in telephone 
exchanges, counters, filters) whose action can be described as a super¬ 
position of two Markov processes with an output which is non-Markov- 
ian. A fair idea of such mechanisms may be obtained from the study 
of the following method of card shuffling. 

In addition to the target deck of N cards we have a similar auxiliary 
deck, and the usual shuffling technique is applied to this auxiliary deck. 
If its cards appear in the order (ai, a 2 , • • *, a#), then we permute the 
cards of the target deck so that the first, second, • • •, Nth cards are 
transferred to the places number a if a 2 , • • •, a^. Thus the shuffling of 
the auxiliary deck indirectly determines the successive orderings of 
the target deck. The latter form a stochastic process which is not of the 
Markov type . To prove this, it suffices to show that the knowledge of 
two successive orderings of the target deck conveys in general more 
clues as to the future than the sole knowledge of the last ordering. We 
show this in a simple special case. 

Let N = 4, and suppose that the auxiliary deck is initially in the 
order (2431). Suppose, furthermore, that the shuffling operation 
always consists of a true “cutting,” that is, the ordering (oi, 02 , a 3 , a 4 ) 
is changed into one of the three orderings (a 2 , a 3 , a 4 , a0, (a 3 , a 4 , ai, a 2 ), 



15.11] 


MISCELLANY 


341 


(« 4 , cq, 02 , 03 ); we attribute to each of these three possibilities prob¬ 
ability 1/3. With these conventions the auxiliary deck will at any time 
be in one of the four orderings (2431), (4312), (3124), (1243). On the 
other hand, a little experimentation will show that the target deck will 
gradually pass through all 24 possible orderings and that each of them 
will appear in combination with each of the four possible orderings of 
the auxiliary deck. This means that the ordering (1234) of the target 
deck will recur infinitely often, and it will always be succeeded by one 
of the four orderings (2431), (4312), (3124), (1243). Now the auxiliary 
deck can never remain in the same ordering, and hence the target 
deck cannot twice in succession undergo the same permutation. 
Hence, if at times n — 1 and n the orderings are (1234) and (1243), 
respectively, then at time n + 1 the state (1234) is impossible. Thus 
the knowledge of the state at times (ft — 1 ) and n conveys more in¬ 
formation than the sole knowledge of the state at time ft. 

* 11. Miscellany 

1. Inverse Probabilities. Although it is most natural to investigate 
the future development of a system, it is occasionally necessary to 
study its past. Consider a Markov chain with states Ek and constant 
transition probabilities pjk, whose absolute probabilities at time n are 
a& (n) = 2a„ (0) p„fc (n) . The conditional probability that the system was at 
time m < n in state Ej , given that at time n it is in Ek, is 

a (w) 

(11.1) g*>, m) = »*<•-">, m < n. 

ak 

This formula makes sense only if a^ (n) > 0 ; otherwise the conditional 
probability in question is not defined. If all a^ n) are positive, then 

(11.1) defines a system of transition probabilities with all the properties 
required for a Markov process. In particular, the qkj(m, n) satisfy the 
Chapman-Kolmogorov identity (10.3) with the time direction reversed, 
namely, 

(11.2) q kj (n, m) = £ ?*-( n > r ) 3.>( r > m) 

V 

(m < r < ft). The qkj(n, m) are called inverse probabilities . 13 Consider, 
in particular, an irreducible chain with stationary probabilities 

* Starred sections may be omitted at first reading since they treat social topics. 
18 A. Kolmogoroff, Zur Theorie der Markoffschen Ketten, Mathematische Annalen, 
vol. 112 (1935), pp. 155-160. 



342 


MARKOV CHAINS 


[15.11 


{ Uk }. Then a^ n) = u k for all n, and u k > 0 (cf. sections 6 and 7). 
In this case the one-step transitions q k ,j(n + 1, n) are independent of n 
and reduce to 

Uj 

(11.3) $*; = —P;*. 

Uk 

The matrix {q k j} is stochastic, so that here the inverse probabilities 
define a Markov chain with constant transition probabilities. If 
Qkj = Vkj the original chain is called reversible; its probability relations 
are then symmetric in time. 

2. The Central Limit Theorem. The theory of recurrent events 
contains further information concerning Markov chains. Let Ek be a 
fixed recurrent state whose recurrence time has finite variance cr* 2 
(this condition is always satisfied if the chain is finite). Let N n denote 
the number of passages up to time n of the system through Ek- Then 
we know from Chapter 12, section 4, that the variable N n is asymp¬ 
totically normally distributed. In the notations of the present chapter 
we have E(N n ) = I/ma: = a way to calculate the variance in the 
case of finite chains will be indicated in the next chapter. It follows 
in particular that for every e > 0 as n —» oo the probability tends to one 
N n 

- u k 

n 

number of passages through E k . Similarly, the strong law of large 
numbers and the law of the iterated logarithm hold and require no 
special proof. In the case of an infinite chain, the recurrence time of 
Ek need not have a finite variance even if its mean is finite. However, 
the general limit theorems for recurrent events apply in this case. 

The random variable N n may be defined as follows. Define a 
sequence of random variables X n so that X n equals 1 or 0 according to 
whether the system is or is not at time n in state E k . Then 
N n = X\ + • • • + X n . This suggests the following generalization. 
We assign to the state E k an arbitrary number x k and let the random 
variable X n equal x k if at time n the system is in state E k . As usual, 
we put S n = Xi + • • • + X n . For finite Markov chains Doeblin 14 
has shown that in general the central limit theorem and the law of the 
iterated logarithm hold for S n . An exception occurs only if the 
numbers x k are chosen so that for every shortest path leading from E k 
back to E k the sum of the x v equals a constant c independent of the 
path. 

14 W. Doeblin, Sur les propri6t& asymptotiques de mouvements r6gis par certains 
types de chaines simples, Thesis, Paris, 1937. 


< e. This is the weak law of large numbers for the 




15.11] 


MISCELLANY 


343 


3. Non-stochastic Matrices . The theorems of this chapter describe 
the asymptotic behavior of the powers P n of an arbitrary stochastic 
matrix P, that is, of a matrix whose elements satisfy the conditions 
(1.2). It is easy to generalize these theorems to a more general class 
of matrices. Let P be an arbitrary (finite or infinite) matrix with non- 
negative elements and denote its row sums by Sj so that Sj = 2 kPjk- We 
assume that the sequence Sj is bounded , that is, that there exists a con¬ 
stant M such that Sj < M. Under these conditions the asymptotic 
behavior of P n is still described by our theorems, inasmuch as P can 
be reduced to a stochastic matrix. 

To fix ideas suppose that the rows and columns of P are numbered 
starting with 1, and consider first the case where Sj < 1 for all j. In 
this case we enlarge (border) the matrix P by adding a row and a 
column number zero whose elements are defined by p 0 o = 1, poi = P 02 
= • • • =0, and pj 0 = 1 — Sj for j > 1. The new matrix Q is stochas¬ 
tic, and its asymptotic behavior is given by-our theorems. On the other 
hand, P n is the submatrix of the corner element poo (n) of Q n • In the 
general case the row sums Sj may exceed unity, but we may replace 
the matrix P by the matrix P* whose elements are Pjk/M . The row 
sums Sj* of P* satisfy the condition Sj* < 1 , and we are able to de¬ 
scribe the asymptotic behavior of the powers P* n . However, the 
matrices P n and P* n differ only by the factor M n , so that our theorems 
actually describe the asymptotic behavior of Pjk {n) in all cases. 

Matrices of the described type occur in the theory of generalized 
random walks with creation or destruction of masses. 

4. Literature. There exists a huge literature on finite Markov chains. 
A detailed account of the various methods of attack and references to 
earlier work will be found in the comprehensive treatise by M. Frechet. 15 
An algebraic treatment of finite chains will be described in the next 
chapter. The entire theory of finite chains can be derived from 
Frobenius , theory of matrices with positive elements. This method 
has been exploited in particular by V. Romanovsky. Unfortunately 
these methods do not carry over to the more interesting case of infinite 
chains, first considered by A. Kolmogorov. 16 His work was continued 

16 Recherches Moriques modemes sur le calcul des probability, vol. 2 (tli6orie des 
6v6nements en chaine dans le eas d’un nombre fini d’6tats possibles), Paris, 1938. 
Another monograph on Markov chains is due to B. Hostinsky, Mtthodes g&Urales 
du calcul des probability , fasc. 52 of the Memorial des sciences malMnialiques , Paris, 
1931. 

18 Anfangsgriinde der Theorie der Markoffschen Ketten mit unendlich vielen mog- 
lichen Zust&ndcn, Matematiceskii Sbomik, N.S., vol. 1 (1936), pp. 607-610. This 
paper contains no proofs. A complete exposition was given only in Russian, in 
Bulletin de V University d’jfitat d Moscow, Sect. A, vol. 1 (1937), pp. 1-15. 



344 


MARKOV CHAINS 


[15.12 


by W. Doeblin 17 and J. L. Doob. 18 The latter derived the ergodic 
properties from general group theory. The method used in this 
chapter, based on the general theory of recurrent events, is new. 19 It 
permits a uniform treatment of finite and infinite chains and represents 
a simplification even in the case of finite chains. The states which 
we call transient are usually called unessential and the interesting prob¬ 
lem of absorption probabilities is neglected. This is explained by a 
predominantly abstract attitude. In practical cases the interest often 
centers on transient states. 


12. Problems for Solution 

1. Classify the states for the three chains whose matrices P have the rows given 
below. Find in each case P 2 and the asymptotic behavior of 

(а) (0, 1/2, 1/2), (1/2, 0, 1/2), (1/2, 1/2, 0); 

( б ) ( 0 , 0 , 0 , 1 ), ( 0 , 0 , 0 , 1 ), ( 1 / 2 , 1 / 2 , 0 , 0 ), ( 0 , 0 , 1 , 0 ); 

(c) (1/2, 0, 1/2, 0, 0), (1/4, 1/2, 1/4, 0, 0), (1/2, 0, 1/2, 0, 0), 

( 0 , 0 , 0 , 1 / 2 , 1 / 2 ), ( 0 , 0 , 0 , 1 / 2 , 1 / 2 ). 


2. We consider throws of a true die and agree to say that at time n the system 
is in state Ej if j is the highest number appearing in the first n throws. Find the 
matrix P n and verify that (3.3) holds. 

3. In example XI find the (absorption) probabilities Xk and yk that, starting 

from Ek , the system will end in E\ or respectively (k = 2, 3, 4, 6). 

4. N black and N white balls are placed in two urns so that each urn contains 
N balls. The number of black balls in the first urn is the state of the system. At 
each step one ball is selected at random from each urn, and the two balls thus 
selected are interchanged. Find the py*. Show that in the limiting distribution 
the term Uk equals the probability of getting exactly k black balls if N balls are 
selected at random out of a collection of N black and N white balls. 10 

5. A chain with states Eo, E\ f • • • has transition probabilities 


Pjk 





(k - p)l 


17 Sur deux probl&mes de M. Kolmogoroff concernant les chaines d6nombrables, 
Bulletin Soci6U Mathfanatique de France , vol. 66 (1939), pp. 1-11. 

18 Topics in the Theory of Markoff Chains, and also Markoff Chains—Denumer¬ 
able Case, Transactions of the American Mathematical Society , vol. 52 (1942), pp. 
37-64, and vol. 58 (1945), pp. 455-473. 

18 A short outline is given in W. Feller, Fluctuation Theory of Recurrent Events, 
Transactions of the American Mathematical Society , vol. 67 (1949), pp. 98-119. 

80 This problem goes back to Laplace; cf. Fr6chet’s book (cited in footnote 15), 
,p. 49. 



15.12] PROBLEMS FOR SOLUTION 345 

where the terms in the sum should be replaced by zero if v > k. Show that 

»<-» - 

Note: This chain occurs in statistical mechanics 21 and can be interpreted as 
follows. The state of the system is defined by the number of particles in a certain 
volume of space. During each time interval of unit length each particle has prob¬ 
ability q to leave the volume, and the particles are statistically independent. 
Moreover, new particles may enter the volume, and the probability of r entrants 
is given by the Poisson expression e~ x \ r /r !. The stationary distribution is then a 
Poisson distribution with parameter \/q. 

6. Ehrenfest model. In example VII let there initially 1x5 j molecules in the 
first container, and let = 2k — a if at time n the system is in state k [so that 
X M is the difference of the number of molecules in the two containers]. Let e n 
- E{X (n) ). Prove that e n +i = (a — 2)e n /a, whence e n = (l~2/a) n (2j — a). (Note 
that e n —*► 0 as n —> <».) 

7. If the number of states is a < oo and if Ek can be reached from Ej , then it 
can be reached in a steps or less. 

8. Let the chain contain a states and let Ej be' recurrent. There exists a num¬ 
ber q < 1 such that for n > a the probability of the recurrence time of Ej exceeding 
n is smaller than q n . {Hint: Use problem 7.) 

9. In an infinite chain with doubly stochastic matrix every state is either tran¬ 
sient or a recurrent null state [cf. example (6.d)]. 

10. Random walk with reflecting barriers. Consider a symmetric random walk in 
a bounded region of the plane. The boundary is reflecting in the sense that, when¬ 
ever in an unrestricted random walk the particle would leave the region, it is now 
forced to return to the last position. Show that, if every point of the region can 
be reached from every other point, there exists a stationary distribution and 
that ilk = 1 /a, where a is the number of positions in the region. 

11. In a finite chain Ej is transient if and only if there exists an Ek such that Ek 
can be reached from Ej but not Ej from Ek. (For infinite chains this is false, as 
shown by the following problem.) 

12. Suppose that in an infinite chain only the transitions Ej —> Ej+\ and Ej 
—> Eq are possible, and that their probabilities are 1 — pj and pj. Show that all 
states arc transient or recurrent according to whether Xpj converges or diverges. 

13. An irreducible chain for which one diagonal element pjj is positive cannot 
be periodic. 

14. A finite irreducible chain is non-periodic if and only if there exists an n 
such that pjk (n) > 0 for all j and k. 

15. In a chain with a states let {x\ y • • •, x a ) be a solution of the system of linear 
equations Xj — 2) pjvX v . Prove: (a) the states E r for which x r > 0 form a closed 
(not necessarily irreducible) set; ( b ) if Ej and Ek belong to the same irreducible 
set, then Xj *=* a*. 

16. Continuation. If to, • • •, x a ) is a solution of Xj = sZpjvX v with | s | = 1 but 
s t* 1, then there exists an integer t > 1 such that s* = 1. If the chain is irreducible, 
then the smallest integer of this kind is the period of the chain. 

21 S. Chandrasekhar, Stochastic Problems in Physics and Astronomy, Reviews of 
Modem Physics , vol. 15 (1943), pp. 1-89, in particular p. 45. 



MARKOV CHAINS 


[15.12 


546 

17. Mean ergodic theorem In an arbitrary chain let 

Ajk M — - ]£ 
n „-i 

If Ej and Ek belong to the same irreducible closed set, then A,vfe (n) tends to a limit 
which is independent of j and equals the stationary probability Uk, whenever the 
latter exists. If Ej and Ek belong to different closed sets, then = 0 for all n. 

If Ek is transient, then —> 0 for all j. 

18. Moving averages . Let {Yk\ be a sequence of mutually independent 
random variables, each assuming the values ±1 with probability 1/2. Put 
X (n) = (Y n + Y n + 1 )/ 2 . Find the transition probabilities 

n ) - Pr\X <"> - k | X™ - j), 

where m < n and j t Jc =* — 1 , 0, 1 . Conclude that { X M } is not a Markov process 
and that (10.3) does not hold. 

“This theorem is a simple consequence of the results of the present chapter. 
However, it is much weaker and can therefore be proved by simpler methods; cf. 
K. Yosida and S. Kakutani, Markoff Processes with an Enumerable Infinite Num¬ 
ber of Possible States, Japanese Journal of Mathematics, vol. 16 (1939), pp. 47-55. 



* CHAPTER 16 


ALGEBRAIC TREATMENT OF FINITE MARKOV CHAINS 

In this chapter we consider a Markov chain with finitely many 
states E\ y • • •, E a and a given matrix of transition probabilities pjk . 
Our main aim is to derive explicit formulas for the n-step transition 
probabilities pjk (w) . We shall not require the results of the preceding 
chapter, except the general concepts and notations of section 3. 

We shall make use of the method of generating functions and shall 
obtain the desired results from the partial fraction expansions of 
Chapter 11, section 7. Our results can also be obtained directly from 
the theory of canonical decompositions of matrices 1 (which in turn 
can be derived from our results). Also, for finite chains the ergodic 
properties proved in Chapter 15 follow from the results of the present 
chapter. However, for simplicity, we shall slightly restrict the gen¬ 
erality and disregard exceptional cases which complicate the general 
theory and do not occur in practical examples. 

The general method is outlined in section 1 and illustrated in sections 
2 and 3. In section 4 special attention is paid to transient states and 
absorption probabilities. In section 5 the theory is applied to finding 
the variances of the recurrence times of the states Ej . 

1. General Theory 

For every fixed j, k we define a generating function 
d.i) Pjk(s) = i Pikes'- 1 . 

n—l 

Multiplying this equation by sp„y and adding over all j, we get 

a 

(1-2) s 2 VvjPjkis) = Ppk(s) ~ Pvk- 

J=i 

For every fixed k we have here a system of a non-homogeneous linear 
equations for the a unknowns Pi*(s), • • •, P a k(s). Theoretically, this 

* Starred chapters treat special topics and may be omitted at first reading. 

1 Cf. the treatise by Frdchet cited in Chapter 15, section 11. 

347 



348 


FINITE MARKOV CHAINS 


[16.1 


system can be solved by means of determinants or by successive 
eliminations of unknowns. We use only the fact that the determinant 
D(s) of the system is a polynomial of degree not exceeding a, and that 
the P v k(s) are rational functions of s with the common denominator 
D(s). We shall consider only the case where the equation D(s) = 0 
has no multiple roots; this is a slight restriction of generality, but the 
theory will cover most cases of practical interest. 

Since the P V k(s ) are rational functions, we can use the partial fraction 
expansion of Chapter 11, section 7. It follows that there exist coeffi¬ 
cients • • •, p*jk (a) such that 


(1.3) 


Pvk 


(n) _ 


Pvk 


( 1 ) 


Si 


+ 


Pvk 


( 2 ) 


+ ••• + 


Pvk 


(a) 


where «i, s 2 , ... are the roots of D(s) = 0. If the degree of D(s) is 
smaller than a, then (1.3) will contain fewer than a terms. It is also 
possible that for some particular values of v and k one or more roots s r 
are common to the numerator and denominator and hence cancel. 
We can take care of such cases by letting the corresponding coefficients 
p v fc (r) be zero. 

We could calculate the roots s r and the coefficients p^ (r) by the 
methods of Chapter 11, but it is preferable to take advantage of 
certain particular properties of Markov chains. Multiply (1.3) by pj v 
and sum over v = 1, 2, .... The result is 


(1.4) 


Pjk 


(n-fl) - 


= £ 


Pj» 


{^ + ...+ ^1 
l Si n s a n 


If the left side is expressed by means of (1.3), one gets an identity 
which can hold for all n only if the coefficients of $i” n , • • •, s a ~~ n on both 
sides are equal. This means that for every fixed r (with 1 < r < a) 
we must have 

(1.5a) PJk M = s r X) pj,pJ r \ 

V—1 


In like manner we get, on multiplying (1.3) by p km and adding over all k, 


(1.56) Pm (,) = s r Z P J r) p km . 

k =1 

The relations (1.5a) show that for k and r fixed the a quantities 
Pi* (r \ • • ■, pat (r) are a solution of the system of a linear equations 

a 

(1.6a) Xj M - s r Pj,x y (r) (j = 1, • • •, a) 

»-i 



16.1] 


GENERAL THEORY 


349 


Similarly, (1.66) implies that for v and r fixed, the p v i (r) , •••,p„a (r) 
satisfy the a linear equations 

a 

(1.6b) y m (r) = s r Vk W Vkm (m = 1, • • •, a). 

k =1 

For a better understanding let us replace s r by an arbitrary s and 
study the two more general systems 

a 

Xj = $ ^ Vjv x v (j — 1 > • * ' y 

v — 1 

a 

Vm = « 2 2/A:Pfcm (m = 1, • • •, a). 

/c=l 

A system of a homogeneous equations in a unknowns can have a 
non-trivial 2 solution only if its determinant vanishes. Now the 
matrices of the two systems (1.7a) and (1.76) are the same except that 
rows and columns are interchanged. Their determinants are therefore 
equal. Moreover, the determinant of (1.7a) obviously equals the 
determinant of the system (1.2), which means that the determinants 
of the two systems (1.7a) and (1.76) vanish for s = s if s 2y • • -, s 0 - 
We can now forget about the generating functions Pjk(s ) and define 
the roots s r as those numbers (real or complex) for which the systems 
(1.7a) and (1.76) admit of non-trivial solutions. The assumption that 
s r is a simple root means that for every fixed r the solutions 
(zi (r) > • • •, x a (r) ) and (?/i (r) , • • y a (r) ) are uniquely determined except, 
of course, for a numerical factor. However, our starting point was the 
discovery that, for k and r fixed, (pu (r) , • • *, p a fc (r) ) is a solution of 
(1.7a), while for v and r fixed (p„i (r) , ■ • *, p m (r) ) is a solution of (1.76). 
Since these solutions are determined up to a numerical factor, we must 
have 

(1.8) p ; -* (r) = c r Xj {r) yic (r \ 

There remains only the calculation of the constants c if • • •, c a . 

From (1.8) and (1.3) we have 

P;fc (n) = X) c r x/ r) y k (r) s r - n . 
r=l 

Vik i2n) = Z 7>iv (n W"\ 

P=1 

2 As usual we call an identically vanishing solution trivial and disregard it. 


(1.9) 
Now 

( 1 . 10 ) 


(1.7a) 

and 

(1.76) 



850 FINITE MARKOV. CHAINS [16.1 

and we can express all quantities in (1.10) by means of (1.9). Equating 
the coefficients of s r ~ 2n on both sides, we find 

(1.11) 1 =Cr£ *, (r) 2/, (r \ 

1 

and thus we have found c r . It is true that the solutions x/ r) and y/ r) 
are determined only up to a numerical factor. However, if we replace 
the Xj (r) by Axj (r \ and the y^ r) by By k (r \ then c r will be changed into. 
c r /AB and the quantity pj k {r) of (1.8) remains unchanged. 

Summarizing, we have the following procedure to calculate Pj k (n) . 

Write down the two systems of linear equations (1.7 a) and (1.76). 
They have a common determinant and admit of non-trivial solutions only 
for values of s for which this determinant vanishes. We suppose that 
the roots «i, s 2 , ... (of which there are at most a) are simple; then for 
each r, the solutions (xi (r) , a- a (r) ) and ( 2 / 1 (r) , * ■ *, y« (r) ) ore determined 

up to an arbitrary multiplicative constant. Find these solutions and then 
the constants c r from (1.11). Form the quantities pj k {r) in accordance 
with (1.8); then pj k {n) is given by (1.3). 

For every fixed r the pj k ^ r) form a matrix which may be constructed 
in the following way. Form a multiplication table with the x/ r) 
heading the rows and the y k {r) heading the columns. Multiplying all 
a 2 elements of this square table by c r , we get the matrix pjfc (r) . To 
construct the matrix (pj k ^ n) ) we have to divide all elements of pj k ^ r) 
by s r n and add the matrices thus obtained for r = 1 , 2, .... Note that 
the roots s r may be simple even if there are fewer than a roots; however, 
if there are a distinct roots, $i, s 2 , • • •, s a , they are certainly simple. 

The case of multiple roots requires certain changes, but may be 
treated by similar methods. The case of greatest interest will be dis¬ 
cussed in section 4. 

In algebra the reciprocals X r = 1 /s r are called characteristic values (or eigen values 
or latent roots) of the matrix P. Zero is a possible characteristic value, but to it 
there corresponds no root s r . This explains why there may be fewer than a roots s r 
even though there are always a characteristic values. The use of s r rather than of 
their reciprocals is more convenient for the method of generating functions. More¬ 
over, it corresponds to the general usage in the theory of integral equations and is 
therefore more natural in probability theory. 

The value s = 1 always occurs among the s rt and to it there corresponds the solution 
(1,1, • • •, 1) of (1.7a). For all r we have | s r | > 1. In fact, 3 a root s r with | s r | <1 

•A direct proof is as follows. Let M be the largest term in the sequence 
| xi (r) I, • • •, | x a (r) | (r fixed). Then from (1.6a) M ^ | s r | 2)p,vAf — | a r \M or 

kl>i. 



16.2] 


EXAMPLES 


351 


would lead to a divergent development in (1.3). If si=* 1 is the only root with 
| 8 r | =*» 1, then —► ciXjWyk^. It is not difficult to show that if there exist 

other roots with | $ r | = 1, then they are necessarily Jth roots of unity, where t is 
an integer; in this case the chain has period t. For details the reader is referred to 
FrSchet’s treatise quoted in Chapter 15, section 11. 

Often it is cumbersome or impossible to find all roots s r . However, it is clear 
that the asymptotic behavior of is determined in first approximation by the 
8 r with | s r | = 1, and in second approximation by the roots s r with the next smallest 
absolute value. 


2. Examples 

(a) Consider first a chain with only two states. The matrix of 
transition probabilities assumes the simple form 


P _n-v v \ 

\ a 1 - a/ 


where 0 < p < 1 and 0 < a < 1. The equations (1.7a) reduce to 
s(l — p)x i + spx 2 = Xi and saxi + s(l — a)x 2 = x 2 . Equating the 
two ratios xi/x 2 , it is found that a solution exists only if either s = 1 
or s = 1/(1 — a — p). The solution corresponding to Si = 1 is (1,1); 
the solution corresponding to s 2 = 1/(1 — a — p) is (p, — a). Next 
take the system (1.76) which now reduces to s(l — p)yi + say 2 = y x 
and spyi + s(l — a)y 2 = y 2 . We know that it can be solved only 
when s = Si or s = s 2 . The corresponding solutions are (a, p) and 
(1, —1). From (1.11) we get Ci = c 2 = l/(a + p). Equations (1.8) 
and (1.3) now enable us to write down explicit formulas for the quan¬ 
tities Pjk in) • The final result can be written in matrix form 

pn = _±_( a P) | (l-«-P) n /P ~P\ 

a + p \a p) a + p \ — a a) 

(where factors common to all four elements have been taken out as 
factors to the matrices). Since | 1 — a — p | < 1, the second matrix 
tends to zero as n —> <», and the first matrix represents the limiting form 
off" 1 . 

(6) Let 


1/2 1/2 0 Oj 


[this is the matrix of problem 1(6), Chapter 15]. The system (1.7a) 



352 

reduces to 


FINITE MARKOV CHAINS 


[16.2 


+ x 2 ) 

(2.2) X\ = sx 4) x 2 — sx 4 , a; 3 =-, a; 4 = sx 3 . 


Since a multiplicative constant remains arbitrary, we may put x 4 = 1. 
Then xi = s, x 2 — s, x 3 = s 2 , x 4 = s 3 , and therefore we must have 
s 3 = 1. Now if we put 


(2.3) 


e = e 2 *' 13 = 


2tt 2 ir 


cos-h i sin — 

3 3 




then the three roots of s 3 = 1 are Si = 1, s 2 = 9, s 3 = 8 2 . (Note that 
we have only three roots, even though there are four states.) The 
solutions Xj (r) corresponding to the three roots are (1,1,1, 1), 
(0, 9, 9 2 , 1), (l 9 », 9 2 , 9, 1). 

From (1.76) we get j/j = sy 3 / 2, y 2 = sy 3 / 2, y 3 = sy 4 , y 4 = s(y x + y 2 ). 
The three sets of solutions corresponding to S] = 1, s 2 = 9, and s 3 = 9 2 
are (1, 1, 2, 2), (0, 0, 2, 20 2 ), (0 2 , 0 2 , 2, 20). Therefore from (1.11) 
Ci = 1/6, c 2 = l/(60 2 ) = 0/6, c 3 = 1/(60) = 0 2 /6. We are now able 
to express all Pjk {n) ■ For example, 


Pn M = P 22 (n) 


1 + 0" + 0 2n 
_ 


1 + 0 2n + 2 + 9 n+1 
(2.4) p, 3 <”) =--- 


l + e 2n+1 + 0 n+2 
Pl4 (») = --- 


etc. The chain is obviously periodic with period 3. 
(c) Let p + q = 1 , and 


(2.5) 


"0 p 0 q' 
q 0 p 0 
0 q 0 p 
-P 0 g 0. 


(This matrix describes a cyclical random walk; cf. example V of 
Chapter 15.) The equations (1.7a) reduce to x x = s(px 2 + qx 4 ), 
x 2 =' s(qxi + px 3 ), x 3 = s(qx 2 + px 4 ), x 4 = s(px x + qx 3 ). Suppose 
that p 9* q. From the first and the third equations we find x x + x 3 
- s(x 2 + x 4 ), and from the remaining equations x 2 + x 4 = s(ati + x 3 ). 



10.2] 


EXAMPLES 


353 


Hence we have either s 2 = 1 or x x + x s = x 2 + x 4 = 0. The first 
alternative leads to the two roots = 1, s 2 = — 1. On the other hand, 
substituting xs = — x\, = —x 2 into the first two equations, we find 

s 2 (p — q ) 2 = — 1, which yields the remaining two roots $ 3 and $ 4 . Thus 

“i % 

(2.6) Si = 1, S2 = 1, S3 = , S4 =-, 

Q-P q-P 

(where i 2 = —1). The corresponding solutions x/ r) contain an 
arbitrary factor, and we are free to put x 4 (r) = 1. Then the four 
sets of solutions are easily found to be (1, 1, 1, 1), (—1, 1, —1, 1), 
(i, — 1, — i, 1), ( — i, — 1, i, 1). The system (1.75) reduces in our 
case to y x = 5 ( 53/2 + py 4 ), V* = s (P2/i + ^ 3 ), 3/s = s(; py 2 + 93 / 4 ), 
2/4 = s(g3/i + pqz)- To the four roots (2.6) there correspond the solu¬ 
tions (1, 1, 1, 1), (-1, 1, -1, 1), (-;, -1, i, 1), (i, -1, -i, 1). For 
the constants c r we find from (1.11) c x = c 2 = c 3 = c 4 = 1/4. Using 
(1.3) and (1.8), we can now write an explicit formula for each sequence 
Pjk (n) (n = 1, 2, 3, ...). In the present case the solutions x^ r) and 
2 // r) are of the simple form (a, a 2 , a 3 , a 4 ), where a is one of the four 
numbers 1, — 1, i, or —i. This enables us to express the pjk in) by the 
single formula 

(2.7) p jle M = i{l + (q - p) n (iy- k ~ n } {1 + 

This formula is valid also for p = q = 1/2. 

It is seen that the term involving (5 — p) n tends to zero, and that 
the other term has period 2. 

( d ) General Cyclical Random Walk (example V of Chapter 15). In 
the preceding example we were able to express the #/ r) and y^ r) as 
powers of the four fourth roots of unity. This suggests trying a similar 
procedure for the general matrix of example V. It is convenient to 
number the states from 0 to a — 1. For brevity we put 

(2.8) d = e 2tVo . 

This is an ath root of unity, and all ath roots are represented by the 
sequence 1, 0, 0 2 , • • *, 0 0 *" 1 . It is easily seen that the systems (1.7a) 
and (1.75) are satisfied by the a sets of solutions 

(2.9) x/ r) = e ri , y k w = e~ rk 
with r = 0,1, 2, • • •, a — 1; they correspond to 



354 FINITE MARKOV CHAINS {16.2 

From (1.11) and (2.9) we find c r = 1/a for all r, and thus finally 

(2.11) vik (n) = - Z qrU ~ k) Ce q ^ Pr ) ' 

rs»0 \J/s=0 / 

It is interesting to verify this formula for n = 1. The factor of q v is 

o—l 

(2.12) £ d TU ~ k+v) ■ 

r«=0 

This sum is zero except when j — ft + v = 0 or a, in which case each 
term equals one. Hence Pjk {l) reduces to qu—j iik>j and to q a +k-j if 
ft < j , and this is the given matrix (pjk)- 

(e) The Occupancy Problem . Example VIII of Chapter 15 shows that 
the classical occupancy problem can be treated by the method of 
Markov chains. The system is in state j if there are j occupied and 
a — j empty cells. If this is the initial situation and n additional balls 
are placed at random, then Pjk^ n) is the probability that there will be k 
occupied and a — k empty cells (so that Pjk^ n) = 0 if ft < i). For 
j = 0 this probability follows from formula (5.4) of Chapter 4. We 
now derive a formula for pjk W , thus generalizing the result of 
Chapter 4. 

Since pjj = j/a and pjj+i = (a — j)/a, it is easily seen that the 
system of equations (1.7a) reduces to 

(2.13) (a - sj)xj = s(a - j)x,+ u j = 0, ■ • •, a. 

For s = 1 we get the solution Xj = 1. It is clear that if s ^ 1 then 
x a = 0, so that s = 1 is the only value of s for which all xj are different 
from zero. If $ is any other value for which (2.13) has a solution, 
then there must exist some index r < a such that x r +\ = 0 but — x r 0; 
from (2.13) it then follows that sr = a. Thus the roots s r for which 
(2.13) has solutions are s r == a/r with r = 1, 2, •••,». The corre¬ 
sponding solutions of (2.13) are obtained successively, putting 
Xo (r) = 1, and j = 0,1,_ We find 

< 2 i4) * - 0 + 0 
so that x/ r) = 0 when j > r. 

For 8 = 8 r the system (1.75) reduces to 

(2.15) (r - j)yj M = (a - j + l)y,_i (r> 
and has the solution 

(2.16) yj M = (“ I 3 (-!) y_r 



16.8] 


RANDOM WALK WITH REFLECTING BARRIERS 


355 


where, of course, =0 if j < r. Since x/ r) = 0 for j > r and 
y/ r) = Ofor j < r, we find from (1.11) easily c r = (x r M i/ r (r) ) _1 = , 

and hence 



On expressing the binomial coefficients in terms of factorials, this 
formula simplifies to 


(2.18) f) £ (-)"(-1)*-’" C 1 ’), 

\a — k/ v==0 \ a / \ v / 


with Pjk (n) = 0 if k < j. 

(Further examples are found in the following two sections.) 


3. Random Walk with Reflecting Barriers 

The application of Markov chains 4 will now be illustrated by a 
complete discussion of example IV of Chapter 15. In the terminology 
of random walks, pjk (n) is the probability that the particle which starts 
from x = j is at time n at x = k. 

The equations (1.7a) take on the form 


xi = s(qxi + px 2 ) 

(3.1) Xj = s(qxj „i + pxj+y) (j = 2, 3, • • •, a - 1) 

x a = s(qx a -1 + px a ). 


This system admits the solution Xj s 1 corresponding to the root 
s = l. To find all other solutions we apply the method of particular 
solutions (which we have used for similar equations in Chapter 14, 
section 4). The middle equation in (3.1) is satisfied by xj = pro¬ 
vided that X is a root of the quadratic equation X = qs + \ 2 ps. The 
two roots of this equation are 


(3.2) 


XiW = 


1 + (1 — 4pgs 2 )^ 
2 ps 


X 2 (s) = 


1 — (1 — 4 pqs 2 )^ 
2 ps 


4 Part of what follows is a repetition of the theory of Chapter 14. Our quadratic 
equation occurs there as (4.7); the quantities Xi(s) and X 2 W of the text were given 
in (4.8), and the general solution (3.3) appears in Chapter 14 as (4.9). The two 
methods are related, but in many cases the computational details will differ radi¬ 
cally. 



356 FINITE MARKOV CHAINS [16.3 

and the most general solution of the middle equation in (3.1) is therefore 

(3.3) xj = A(8)\i j (s) + B(s)\ 2 j (s), 

where -A(s) and B(s) are arbitrary. The first and the last equation 
in (3.1) will be satisfied by (3.3) if and only if x 0 = xi and x a — x a +\. 
This requires that A (s) and B(s) satisfy the conditions 

A(s){ 1 - \ x (s)} + B(s){ 1 - X 2 (s)} = 0 

(3.4) 

^l(«)Xi a (5){l - X x ($)} + £(s)X 2 °(s){ 1 - x 2 (s)} = 0. 

However, these two equations are compatible only if 

(3.5) X x a (s) = \ 2 a (s), 

and we have to determine the values of s for which (3.5) is possible. 

From the definition (3.2) we have Xi(s)X 2 (s) = q/p, and (3.5) implies 
that Xi (s)(p/q) H and X 2 (s)(p/g)^ must be (2a)th roots of unity. These 
roots can be written in the form 


(3.6) 


. . 7rT 7 rr 

e tTr/a = cog -[_ { g i n — ^ 

a a 


where i 2 = —1 and r = 0, 1, 2, • • •, 2 a — 1. Thus all solutions of 
(3.5) are among 



The value r = a must be disregarded, since for it Xi(s) = X 2 (s), 
A(s) ass —B(s), so that it leads only to the trivial solution Xj s 0. 
To r = 0 there corresponds the solution Xj = 1, which we have already 
considered. To r = 1, 2, • • *, a — 1 there correspond a — 1 distinct 
solutions; if we let r = a + 1 , a + 2, • • •, 2a — 1 , we get the same 
solutions with Xi(s) and X 2 (s) interchanged. Thus we have found a 
distinct sets of solutions of (3.1), and we know that there can be no 
more. To each value r we can find the root s r from (3.7); it is 

s r = {2(p^)^cos7rr/a}“ 1 . 


For r = 1, 2, • • *, a — 1 we get the values of Xi (s) and X 2 («) from 
(3.7), and then from (3.4) A($) = 1 - X 2 (s) and B{s ) = — {1 — Xi(«)}. 
(Remember that a multiplicative constant remains arbitrary.) Substi¬ 
tuting into (3.3), we find the a — 1 sets of solutions 




,0‘+i)/2 rrij-l) 
sin- 


a 



16.3] RANDOM WALK WITH REFLECTING BARRIERS 357 

(r = 1, 2, •••, a — 1). To this we add the solution previously found 
(3.9) z/ 0) = 1. 

It is easy to verify that (3.8) and (3.9) represent solutions of the 
given system (3.1). 

We have now to find solutions of the second system of linear equa¬ 
tions. In the present case (1.75) takes on the form 


2 /i = *7(2/1 + 2/2), 

(3.10) y k = s(pyk-i + qyk+i), (fc = 2, • • •, a - 1) 

2 U = sp(y a _ 1 + y a ). 


The middle equation is the same as (3.1) with p and q interchanged, 
and its general solution is therefore obtained from (3.3) simply by 
interchanging p and q. The first and the last equations can be satis¬ 
fied if (3.7) holds, and a simple calculation shows that for r = 1, 
2, • • •, a — 1 the solution of (3.10) is 

/„\(k-l)l 2 ,^ 0 ^ __ 

sin- 

s#/ a \q/ a 


(3,1, w <,.0‘' 2 sin ^-0° 


For s = 1 we find similarly 
(3.12) y k 


( 0 ) 


-©* 


The next step consists in evaluating the coefficients c r in (1.11). 
The sum simplifies if sin 2 irrj/a is expressed in terms of the cosine of 
the double angle, and this in turn by means of complex exponentials. 
Then we have only to sum finite geometric series and find easily 

2 p \ .. 7rr] “ 1 

(3.13) c r = — 1 - 2(pq)^ cos — [ (r « 1, 2, • • •, a - 1). 

a l a J 


For r = 0 we get, 
(3.14) 


q ( p/q) - 1 
V (p/q) a - 1 ’ 


provided that p q. If p = q = 1/2, then (3.13) remains valid, 
but (3.14) is to be replaced by c 0 = 1/a. 

These formulas lead to the final resxilt 


Pik (n) 


(p/q) -1 

(p/q)° - 1 


0* V 2»+y +("-;+*) I2q(n+j—k) 12^—1 £ s r 


(3.15) 



FINITE MARKOV CHAINS 


[16.4 


where S r stands for 

/?\ H . »rO‘-l)H. *rk /q\* . *r(fc-l)1 

I - 1 sm- \ } sm-I - 1 sm- \ 

\p/ a J l a \p/ a J 

1 — 2 (pq)^ cos — 
a 

As 7 i oo, the second term in (3.15) tends to zero, and we find again 
that tends to a stationary distribution independent of j . [This 
limiting distribution was derived by other methods in Chapter 15, 
example (6.6).] Passing to the limit a —> oo, we get the formula for a 
random walk with a single reflecting barrier; in the limit, the sum is 
replaced by an integral. 6 

4. Transient States; Absorption Probabilities 

The theorem of section 1 was derived under the assumption that the 
roots si, 82 j ... are distinct. The presence of multiple roots does not 
require essential modifications, but we shall discuss only a particular 
case of special importance. The root S\ = 1 is multiple whenever the 
chain contains two or more closed subchains, and this is a frequent 
situation in problems connected with absorption probabilities. It is 
easy to adapt the method of section 1 to this case. For conciseness and 
clarity, we shall explain the procedure by means of examples which 
will reveal the main features of the general case. 

Examples, (a) Consider the matrix of transition probabilities 

"1/3 2/3 0 0 0 0 " 

2/3 1/3 0 0 0 0 

00 1/4 3/4 0 0 

(4.1) P = 

•.0 0 1/5 4/5 0 0 

1/4 0 1/4 0 1/4 1/4 

. 1/6 1/6 1/6 1/6 1/6 1 / 6 . 

It is clear that Ex and E 2 form a closed set (that is, no transition is 

* For analogous formulas in the case of one reflecting and one absorbing barrier 
cf. M. Kac, Random Walk and the Theory of Brownian Motion, American Mathe¬ 
matical Monthly , vol. 54 (1947), pp. 369-391. The definition of the reflecting barrier 
is there modified so that the particle may reach x » 0; whenever this occurs, the 
next step takes it to x =* 1. The explicit formulas are then more complicated. 
Kac also found formulas for p,* (n) in the Ehrenfest model (example VII of Chap¬ 
ter 15). 


icr icrj 

cos n — j sin- 

a a 



, 16.4] TRANSIENT STATES; ABSORPTION PROBABILITIES 859 

possible to any of the remaining four states; cf. Chapter 15, section 4). 
Similarly E 3 and P 4 form another closed set. Finally, E& and P 6 are 
transient states. After finitely many steps the system passes into one 
of the two closed sets and remains there. 

The matrix P has the form of a partitioned matrix 




A 

0 

O' 

(4.2) 

P = 

0 

B 

0 



_U 

V 



where each letter stands for a two by two matrix and each zero for a 
matrix with four zeros. For example, A has the rows (1/3, 2/3) and 
(2/3, 1/3); this is the matrix of transition probabilities corresponding 
to the chain formed by the two states E x and E 2 . This matrix can be 
studied by itself, and the powers A n can, be obtained from example 
(2 .a) with p = a = 2/3. When the powers P 2 , P 3 , ... are calculated, 
it will be found that the first two rows are in no way affected by the 
remaining four rows. More precisely, P n has the form 


(4.3) 



0 

B n 


0 ’ 
0 


where A n , B n , T n are the nth powers of A, B, and T, respectively, and 
can be calculated 6 by the method of section 1 [cf. example (2.a) where 
all calculations are performed]. Instead of six equations with six 
unknowns we are confrpnted only with systems of two equations with 
two unknowns each. 

It should be noted that the matrices U n and V n in (4.3) are not 
powers of U and V and cannot be obtained in the same simple way as 
A n , B n , and T n . However, in the calculation of P 2 , P 3 , ... the third 
and fourth columns never affect the remaining four columns. In other 
words, if in P n the rows and columns corresponding to E 3 and P 4 are 
deleted, we get the matrix 



• In T the rows do not add to unity so that T is not a stochastic matrix. How¬ 
ever, the method of section 1 applies without change, except that s = 1 is no longer 
a root (so that T n —> 0). 



360 FINITE MARKOV CHAINS [10.4 

which is the nth power of the corresponding submatrix in P, that is, of 


(4.5) 



■1/3 

2/3 

0 

0 

2/3 

1/3 

0 

0 

1/4 

0 

1/4 

1/4 

-1/6 

1/6 

1/6 

1/6 


Therefore (4.4) can be calculated by the method of section 1, which 
in the present case simplifies considerably. The matrix V n can be 
obtained in a similar way. 

Usually the explicit forms of U n and V n are of interest only inasmuch 
as they are connected with absorption probabilities. If the system 
starts from, say, E 5 , what is the probability X that it will eventually pass 
into the closed set formed by E x and E 2 (and not into the other closed 
set)? What is the probability \ n that this will occur exactly at the nth 
step t Clearly p 5 i (n) + p 52 (n) is the probability that the considered 
event occurs at the nth step or before, that is, 

P51 (n) + P52 (n) = Xi + X 2 + • • • + X n . 


Letting n —> oo, we get X. A preferable way to calculate X n is as fol¬ 
lows. The (n — l)st step must take the system to a state other than 
Ei and E 2 , that is, to either E 5 or Eq (since from P 3 or P 4 no transi¬ 
tion to Ei and E 2 is possible). The nth step then takes the system to 
Ex or E 2 . Hence 


(4.6) 


An = Ps 5 ( ” ^ (P51 + P52) + Pfi6 ( ” 1) (Pfll + P02> 

-7 + JP* 1 - 0 . 

4 o 


It will be noted that X n is completely determined by the elements of 
T n -\ and this matrix is easily calculated. In the present case 


PS 8 <n) 




and hence 


7 

ii 



(6) Brother-sister Mating. As a second example we give a complete 
treatment of example XI of Chapter 15. A glance at the matrix 
shows that the states Ei and Eg form a closed set each (a fact which 
is clear from the biological meaning). If the system starts from any 
other state Ej, it will eventually pass either into Ei or into E 6 and 
then remain there. The breeder desires to know the corresponding prob¬ 
abilities and the expected duration of the process. 



16.4] TRANSIENT STATES; ABSORPTION PROBABILITIES 


361 


Deleting the first and fifth column and row, we get the reduced 
matrix 


(4.7) 


-1/2 1/4 0 Oi 
1/4 1/4 1/4 1/8 
0 1/4 1/2 0 

. 0100 . 


The powers T n will now be calculated by the method of section 1. 
They represent the transition probabilities among transient states. 
The equations (1.7a) reduce to 

s( 2xi + x 2 ) s(2xi + 2x 2 + 2x 3 + x 4 ) 

Xl _ - , * 2 - » 

(4.8) 

s(x 2 + 2x a ) 

£3 y == S(C2* 

4 


It has a solution only if the determinant vanishes, and this condition 
leads to a fourth-degree equation in s. To simplify writing we put 

(4.9) 0! = 5 h - 1, d 2 = 5^ + 1. 

Then the four roots s r are 

(4.10) si = 2, s 2 = 4, s 3 = 0i, s 4 = ”-02, 

and the corresponding solutions • • •, xy (r) ) of (4.8) are 

(4.11) (1,0,-1,0), (1,-1,1,-4), (1,0,, l,0i 2 ), (lj 02,1,^2^)* 

The system of linear equations for y k M is obtained by specialization 
from (1.76), and the four sets of solutions are in proper order 

(4.12) (1,0,-1,0), (1,-1,1,-1/2), 

( 1 , 0 „ 1 , 0 ! 2 / 8 ), ( 1 , - 0 2 , 1 , 0 2 2 / 8 ). 

From (1.11) we find the four constants c, = 1/2, c 2 = 1/5, c 3 = 02 2 /4O, 
c 4 = 0i 2 /4O. From (1.8) we get the pjk (r) ; and finally (1.3) gives us 
Vik' n ' > for all transient states, that is, for j, k — 2,3, 4, 6. For fixed j, k 
the sequence pjk M is the sum of four geometric series with ratios 

s u •••> « 4 * 

An absorption in exactly at the nth step is possible only if the 
(n — l)st step takes the system into either E 2 or E 3 , and the nth step 
into Ei. The probability for this is p,- 2 (n_1) /4 + py 3 (n) /16- Similarly, 
the probability of absorption at £5 is Pj3 <n-1 Vl6 + Pj4 (n-1) /4. Sum- 



FINITE MARKOV CHAINS 


[16.5 


ming over all n we get the probabilities that the system will eventually 
pass into and stay in E\ and E 5, respectively. The actual calculation 
of these probabilities requires only the summation of four geometric 
series. 

5. Application to Recurrence Times 

In problem 6 of Chapter 12 it is shown how the mean /* and the 
variance a 2 of the recurrence time of a recurrent event 8 can be calcu¬ 
lated in terms of the probabilities u n that 8 occurs at the nth trial. 
If 8 is not periodic, then 

1 00 / 

(5.1) u n —> - and 22 ( u n — 

M n -0 \ 

provided that a 2 is finite. 

If we identify 8 with a recurrent state Ej, then u n = p// n) (and 
n 0 = 1). In a finite Markov chain all recurrence times have finite 
variance (cf. problem 8 of Chapter 15), so that (5.1) applies. Suppose 
that Ej is not periodic and that formula (1.3) applies. Then Si = 1 
and | s r | > 1 for r = 2, 3, ..., so that py/ n) —» pyy (1) = 1 /py. To the 
term u n — l//z of (5.1) there corresponds 

(5.2) Pyy (n) - - - S Pi/'V*. 

Mi r=2 

This formula is valid for n > 1; summing the geometric series with 
ratio s r x , we find 

00 / 1 \ a n - (r) 

(5.3) E (p„ w - -) - E ~ 

n—\ \ Mj'/ r=2 1 

Introducing this into (5.1), we find that if Ej is a non-periodic recurrent 
state , then its mean recurrence time is given by py = l/pyy (1 \ and 
variance of its recurrence time is 

a n- (r) 

(5.4) <ry 2 = My - Mi 2 + 2My 2 22 7 

r=2 S r — 1 

provided, of course, that formula (1.3) is applicable and Si = 1. The 
case of periodic recurrent events and the occurrence of double roots 
require only obvious modifications. 


D- 


<r — ii + m 2 

V 



CHAPTER 17 


THE SIMPLEST TIME-DEPENDENT STOCHASTIC 
PROCESSES 1 


1. General Orientation 

Random walks and Markov chains are examples of stochastic 
processes 2 where changes can occur only at fixed times, say, t = 1, 

2, 3, .... On the other hand, in Chapter 6, sections 5 and 6, we were 
concerned with phenomena such as telephone calls, radioactive dis¬ 
integrations, and chromosome breakages,, where changes may occur 
at any time. It is clear that a complete description of these processes 
leads beyond the domain of discrete probabilities. To fix ideas, 
consider the incoming calls at a telephone exchange (or, rather, an 
idealized mathematical model of the actual process). Every instant t 
corresponds to a trial, and the result of an experiment may be described 
in terms of a function X(t) giving the number of calls up to time t. If 
the first call occurs at time t Xl the second at t 2 , etc., then the function 
X(t) equals 0 for 0 < t < t\, 1 for t\ < t < t 2 , 2 for t 2 < t < fe, etc. 
Conversely, every non-decreasing function X{t) } assuming only the 
values 0, 1 , 2, ..., represents a possible development at our telephone 
exchange. In other words, a complete description of our conceptual 
experiment calls for a sample space whose points are functions X(t) 
(and not sequences as in the case of discrete trials). Further, we may 
consider compound events such as “seven calls within a minute on a 
certain day”; this is obviously the aggregate of those X(t) which satisfy 
the condition that for some point t of a specified interval we have 
X(t + h) — X{t) > 7, where h represents the span of one minute. 

We cannot deal here with such complicated sample spaces and must 
therefore defer the study of the more delicate aspects of the theory. 
Fortunately, certain interesting questions can be answered even with 
the simple means now at our disposal. 

If we limit the consideration to the number of calls X(t) within an 
arbitrary but fixed period of duration t , then X{t) is a random variable 
of the familiar type, assuming the values 0, 1, 2, .... Let P n (f) be 

1 This chapter is almost independent of Chapters 10-16. 

2 Cf. the footnote on p. 337. 


363 



364 


STOCHASTIC PROCESSES 


[17.2 


the probability that X(t) = n. It is true that the distribution 
{-P»(0} depends on the duration t, that is, on a continuous parameter. 
However, most of the probability distributions already introduced 
depend on a parameter, and we are not in an essentially new situation. 

In Chapter 6, section 5, we used a limiting process to show that 
under certain conditions X(t) has a Poisson distribution, that is, 

„ (\t) n 

( 1 . 1 ) Pn(t) 

A second derivation was given in Chapter 11, section 4, in connection 
with the Pascal distribution. We now derive this result by a new 
method which is more flexible and can be applied to more complicated 
processes. We start by translating the physical or intuitive description 
of a process into properties required of the probabilities P n (t). In this 
way we get a set of plausible and simple postulates on the distribution 
\Pn(t)}, from which analytic expressions for P n (t) can be derived. 

The artificial limitation to discrete probabilities has unavoidable 
drawbacks. Consider, for example, the zero term in (1.1). We 
interpret 

(1.2) P 0 (t) = e~ Xi 

as the probability that no call occurs within an observation period of 
duration t. This formulation suggests that P 0 (t) also might be inter¬ 
preted as the probability that the waiting time (starting at an arbitrary 
moment) up to the first call exceeds t. It can be shown that this 
interpretation is correct, but it will be noticed that it involves probabil¬ 
ities in a continuum. The operational meaning of our first formulation 
is as follows: make a series of “identical observations” with a fixed 
observational period t. Each trial results in either “no call” (success) 
or “one or more calls” (failure). Then we have Bernoulli trials with 
the probability of success e~ M . With the second interpretation we are 
to wait until a call arrives. Every positive number is a possible waiting 
time, so that the sample space corresponding to each trial is the half¬ 
line t > 0. Formula (1.2) then represents a continuous probability 
distribution and as such will be treated in volume 2 (opening a new 
approach to the Poisson distribution). 

2. The Poisson Process 

The simplest stochastic processes are of the type considered in 
Chapter 6, section 5. A system is subject to instantaneous changes of 
state which can occur at any time; these changes are due to the occur- 



17.2] 


THE POISSON PROCESS 


365 


rence of random events such as splitting of physical particles, arriving 
of telephone calls, or breakage of a chromosome under harmful irradia¬ 
tion. All changes are of the same* kind, and we are concerned only 
with their total number. Each change is marked by a point on the 
time axis, so that we are studying certain random distributions of 
points on a line. 

The physical processes which we have in mind are characterized by 
the two properties that they are stationary and that future changes are 
independent of past changes. By this we mean that the forces and 
influences which determine the process remain absolutely unchanged, 
so that the probability of any particular event is the same for all time 
intervals of length t , independent of where this interval is situated and 
of the past history of the system. 8 

We now translate this description into mathematical language. The 
process is to be described in terms of probabilities 4 P n (t) that exactly n 
changes occur during a time interval of length t. In particular, P o (0 
is the probability of no change, and 1 — P 0 (t) the probability of one 
or more changes. We shall assume that 6 as t —> 0 


( 2 . 1 ) 


1 ~ Pp(f) 

t 


X 


where X is a positive constant [the derivative of — P 0 (t) at t = 0]. 
Then for a small interval of length h the probability of one or more 
changes is 1 — Po(h) = \h + o(h), where the term o{h) denotes a 
quantity which is of smaller order of magnitude than h. We now 
formulate our 

8 In a telephone exchange incoming calls are more frequent during the busiest 
hour of the day than, say, between midnight and 1 a.m.; the process is therefore 
not homogeneous in time. However, for obvious reasons telephone engineers are 
concerned mainly with the “busy hour” of the day and for that period the process 
can be considered homogeneous. Experience shows also that during the busy hour 
the incoming traffic follows the Poisson distribution with surprising accuracy. 
Similar considerations apply to automobile accidents, which are more frequent on 
Sundays, etc. 

4 Our notation implies that these probabilities are independent of where the 
interval of length t is taken. For a non-homogeneous process we should have to 
introduce the probability P n (h, h) that n changes occur in the interval h < t < to. 

* Instead of assuming (2.1) we may start from the following reasoning. The 
event “no change in the time interval (0, t + s)” requires that no change occurs 
within the two intervals (0, t) and ( t , t + s). Because of the assumed independence 
this leads to the equation Po(s + 0 = Po(s)Po(t) whose only positive bounded 
solution is e~ u . It follows that Po(0 = e~ M , and this implies (2.1). However, 
we prefer to start from (2.1), since this procedure leads in a more natural way to 
various generalizations. 



STOCHASTIC PROCESSES 


[17.2 


Postulates for the Poisson Process . Whatever the number of changes 
during (0, t), the ( conditional ) probability that during (t, t + h) a change 
occurs is \h + o(h), and the probability that more than one change occurs 
is of smaller order of magnitude than h . 

These conditions easily lead to a system of differential equations for 
Pn(t). Consider two contiguous intervals (0, t) and ( t , t + h) } where 
h is small. If n > 1, then exactly n changes can occur in the interval 
(0, t + h) in three mutually exclusive ways: (1) no change during 
(t , t + h) and n changes during (0, t ); (2) one change during ( t y t + h) 
and n — 1 changes during (0, t); (3) x > 2 changes during ( t , t + h) 
and n — x changes during (0, £). According to our hypotheses, the 
probability of the first contingency is P n (t) times the probability of no 
change during (i t , t + h) and this last is 1 — \h — o(h). Similarly, 
the second contingency has probability P n _i(0AA + o(h), while the 
last has a probability of smaller order of magnitude than h. This 
means that 

(2.2) Pn(jt + h) - P n (t)( 1 - A h) + P n -i(t)\h + o(h) 


or 


(2.3) 


P n (t + h) - PM 
h 


~\Pn(t) + A Pn-l 



As h —»0, the last term tends to zero; hence the limit of the left side 
exists and 


(2.4) Pn (f) = -A P n (t) + \P n —i(t) (n > 1). 

For n = 0 the second and third contingencies mentioned above do not 
arise, and therefore (2.4) is to be replaced by the simpler equation 

(2.5) P 0 (t + h) = P o (0(l - hh) + o(A), 
which leads to 

(2.6) Po'(0 = -AP o (0. 

From (2.6) and Po(0) = 1 we get P 0 (t) = e~~ u . Substituting this 
Po(0 into (2.4) with n = 1, we get an ordinary differential equation 
for ,Pi(«). Since Pi(0) = 0, we find easily that P\(t) = Ain 
agreement with the Poisson distribution (1.1). Proceeding in the same 
way, we find successively all terms of (1.1). 

The salient feature of this new derivation of the Poisson distribution 
is that it starts directly from plausible physical assumptions. The 
Poisson distribution no longer appears as an approximation to the 
binomial distribution or as a limiting distribution, but stands in its 



17.3] 


PURE BIRTH PROCESS 


367 


own right (or, one might say, as the expression of a physical law). 
The main advantage of the new derivation is that it lends itself to 
many generalizations. 6 

3. The Pure Birth Process 

In the Poisson process the probability of a change during (t , t + h) 
is independent of the number of changes during (0, t). The simplest 
generalization consists of dropping this assumption. We then assume 
that, if n changes occur during (0, t), the probability of a new change 
during (t , t + h) equals \ n h plus terms of smaller order of magnitude 
than h; instead of a single constant X characterizing the process, we 
have a sequence X 0 , Ai, X 2 , .... 

It is convenient to introduce a more flexible terminology. Instead 
of saying that n changes occur during (0, t), we shall say that the system 
is in state E n . A new change then becomes a transition E n —> 2? n+1 . 
In a pure birth process transitions from E n are possible only to 2? n+1 . 
Such a process is characterized by the following 

Postulates . If at time t the system is in state E n (n = 0, 1, 2, .. .), 
then the probability that during (t, t •+• h) a transition to E n+ i occurs equals 
\ n h + o(h); the probability of several changes is of smaller order of 
magnitude than h. 

The salient feature of this assumption is that the time which the 
system spends in any particular state plays no role: there are sudden 
changes of state but no aging as long as the system remains within a 
single state. 

Again let P n (t) be the probability that at time t the system is in 
state E n . The functions P n (t ) satisfy a system of differential equations 
which can be derived by the argument of the preceding section, with 
the only change that (2.2) is replaced by 

(3.1) P n (t + h) = P n (t)( 1 - Kh) + Pn —1 iffh-n—\h + 0(h). 

In this way we get the basic system of differential equations 

(3 2) = ~ XnPnW + (» > 1), 

Po'(t) = — XoPo(0- 

We can calculate P 0 (0 first and then, by recursion, all P n (0- If the 
state of the system represents the number of changes during (0, t) 7 

8 The processes of this chapter and their relation to diffusion processes are treated 
in a lecture (of January 1946) by W. Feller, On the Theory of Stochastic Process 
with Particular Reference to Applications, Proceedings of the Berkeley Symposium 
on Mathematical Statistics and Probability , 1949, pp. 403-432. There the reader 
will also find further references. 



STOCHASTIC PROCESSES 


117.3 


then the initial state is E 0 so that JP 0 (0) = 1 and hence Po(t) « e - "* 0 *. 
However, it is not necessary that the system start from state Eq [see 
example (3.6)]. If at time zero the system is in E{, then we have 

(3.3) Pi( 0) = 1, P n (0) = 0, for n * i. 

These initial conditions uniquely determine the solution {P n (t)\ of 
(3.2). [In particular, P 0 (t) = P x (t) = ••• = P*_i(<) = 0.] Explic¬ 
it formulas for P n (t) have been derived independently by many 
authors, but are of no interest to us. It is easily verified that for 
arbitrarily prescribed X n the system \P n (t)} has all required properties, 
except that under certain conditions 2 P n (t) < 1 . This phenomenon 
will be discussed in section 4. 

Examples, (a) Radioactive Transmutations . A radioactive atom, 
say urapium, may by emission of particles or 7 -rays change to an atom 
of a different kind. Each kind represents a possible state of the system, 
and as the process continues, we get a succession of transitions 
E 0 —> Ei —> E 2 —> • • • —> E m . According to accepted physical theories, 
the probability of a transition E n —> E n +1 remains unchanged as long 
as the atom is in state E ny and this hypothesis is expressed by our start¬ 
ing supposition. The differential equations (3.2) therefore describe the 
process (a fact well known to physicists). If E m is the terminal state 
from which no further transitions are possible, then X m = 0 and the 
system (3.2) terminates with n = m. [For n > m one gets automati¬ 
cally P n (0 = 0 .] 

( 6 ) The Yule Process. Consider a population of members which can 
(by splitting or otherwise) give birth to new members but cannot die. 
Assume that during any short time interval of length h each member has 
probability \h + o(h) to create a new one; the constant X determines 
the rate of increase of the population. If there is no interaction among 
the members and at time t the population size is n, then the probability 
of an increase during ( t , t + h) is n\h + o(h). The probability P n (t) 
that the population numbers exactly n elements therefore satisfies ( 3 . 2 ) 
with X n = nX, that is, 

(3.4) P n '(t) = —ri\P n (t) + (n- 1)XJP n —1 (t) (n > 1). 

If i is the population size at time t = 0, then the initial conditions (3.3) 
apply. The solution is easily found to be 

(3.5) P„(t) = ( U ~ l ) e~ iU (l - «-**)»-< 

\n — 1 / 

for » > i, while, of course, P n (t) = 0 for n < i. 



17.4] 


DIVERGENT BIRTH PROCESSES 


This type of process was first studied by Yule 7 in connection with 
the mathematical theory of evolution. The population consists of the 
species within a genus, and the creation of a new element is due to 
mutations. The assumption that each species has the same probability 
of throwing out a new species neglects the difference in species sizes. 
Since we have also neglected the possibility of a species dying out, 
(3.5) can be expected to give only a crude approximation. Furry 8 used 
the same model to describe a process connected with cosmic rays, but 
again the approximation is rather crude. The differential equations 
(3.4) apply strictly to a population of particles which can split into 
exact replicas of themselves, provided, of course, that there is no inter¬ 
action among particles. 

* 4. Divergent Birth Processes 

The solution {P n (0! of the infinite system of differential equations 
(3.2) subject to initial conditions (3.3) cari be calculated inductively, 
starting from Pi(t ) = e~ Xit . The distribution {P n (t)\ is therefore 
uniquely determined. From the familiar formulas for solving linear 
differential equations it follows also that P n (t) > 0. The only question 
left open is whether {P n (01 is an honest probability distribution, that 
is, whether or not 

(4.1) 2 P n (t) = 1 

for all t. We shall see that this is not always so: if the coefficients X n 
increase sufficiently fast, then it may happen that 

(4.2) 2 Pn(t) < 1. 

At first sight this possibility appears surprising and, perhaps, disturb¬ 
ing, but it finds a ready explanation. The left side in (4.2) may be 
interpreted as the probability that during time t only a finite number 
of changes takes place. Accordingly, the difference between the two 
sides in (4.2) accounts for the possibility of infinitely many changes, 
or a sort of explosion. For a better understanding of this phenomenon 

* Starred sections treat special topics and may be omitted at first reading. 

7 G. Udny Yule, A Mathematical Theory of Evolution, Based on the Conclu¬ 
sions of Dr. J. C. Willis, F.R.S., Philosophical Transactions of the Royal Society, 
London , Series B, vol. 213 (1924), pp. 21-87. Yule does not introduce the differ¬ 
ential equations (3.4) but derives P n (0 by a limiting process similar to the one 
which we used in Chapter 6, section 5, for the Poisson process. 

8 Furry, On Fluctuation Phenomena in the Passage of High-energy Electrons 
through Lead, Physical Reviews, vol. 52 (1937), p. 569. 



STOCHASTIC PROCESSES 


870 


[17.4 


let us compare our probabilistic model of growth with the familiar 
deterministic approach. 

The quantity X n in (3.2) could be called the average rate of growth 
at a time when the population size is n. For example, in the special 
case (3.4) we have X n = nX, so that the average rate of growth is pro¬ 
portional to the actual population size. If growth is not subject to 
chance fluctuations and has a rate of increase proportional to the in¬ 
stantaneous population size, then the population size x(t) varies in ac¬ 
cordance with the deterministic differential equation 


(4.3) 


dx(t) 

dt 


= \x(t). 


It follows that at time t the population size is 

(4.4) x(t) = ie xt , 

where i = x(0) is the initial population size. The connection between 

(3.4) and (4.3) is not purely formal. It is readily seen that (4.4) actu¬ 
ally gives the mean of the distribution (3.5), so that (4.3) describes the 
expected population size, while (3.4) takes account of chance fluctua¬ 
tions. 

Let us now consider a deterministic growth process where the rate 
of growth increases faster than the population size. If the rate of 
growth is proportional to x 2 (t), we get the differential equation 


(4.5) 

whose solution is 

(4.6) 


dx(t) 

dt 


= \x 2 {t) 


x(t) = 


i 


1 - \it 


Note that x(t) increases over all bounds as t —> 1/Xi. In other words, 
the assumption that the rate of growth increases as the square of the 
population size implies an infinite growth within a finite time interval. 
Similarly, if in (3.4) the X n increase too fast, there is a finite probability 
that infinitely many changes take place in a finite time interval. A 
precise answer as to the conditions when such a divergent growth 
occurs is given by the 

Theorem . In order that (4.1) may hold for all t it is necessary and 
sufficient that the series 



(4.7) 
diverge . 



17.5] 


THE BIRTH AND DEATH PROCESS 


371 


Proof. Letting 

(4.8) S k (t) = P o (0 + • ■ • + P k (t), 
we get from (3.2) 

(4.9) S k '(t) = -\ k P k (t) 
and hence for k > i 

(4.10) 1 - S k (t) = \ k f P k (r) dr. 

•^0 

Since all terms in (4.8) are non-negative, the sequence S k (t) (for fixed t) 
can only increase with k, and therefore the right side in ( 4 . 10 ) decreases 
monotonically with k. Call its limit #»(<). Then for k > i 

(4.11) \ k f P k {r) dr > 

do 

and hence 

(4.12) f S n (r) dr > M«) (I + j- + . • • + --Y 

■'0 \Ai Ai_j_i \ n / 

Because of (4.10) we have S n (t) < 1, so that the left side in (4.12) is 
at most t. If the series (4.7) diverges, the second factor on the right 
in (4.12) tends to infinity, and the inequality can hold only if p(t) = 0 
for all t. In this case the right side in (4.10) tends to 0 as k <*>, and 
therefore S n (t) 1 , so that (4.1) holds. Conversely, the left side of 
(4.12) is less than X 0 _1 + ^i~ l + • • • + Xu -1 . If (4.7) converges, 
this expression is bounded and hence it is impossible that S n (t) —> 1 
for all t. 

6. The Birth and Death Process 

The pure birth process of section 3 provides a satisfactory description 
of radioactive transmutations, but it can obviously not serve as a 
realistic model for changes in the size of populations whose members 
can die (or drop out in any way). This suggests generalizing the model 
by permitting transitions from the state E n not only to the next higher 
state E n+ i, but also to the next lower state 2? n _i. (Still more general 
processes will be defined in section 9.) Accordingly we now start 
from the following 

Postulates, The system changes only through transitions from states 
to their next neighbors (from E n to E n +1 or /£ n -i if n > 1, but from 2? 0 
to Ei only). If at any time t the system is in state E nf then the 'probability 
that during (t, t + h) the transition E n —> E n +1 occurs equals \ n h + o(h), 



372 


STOCHASTIC PROCESSES 


[17,5 


and the probability of E n —►i? n -i {if n > 1) equals p n h + o(h). The 
probability that during (t, t + h) more than one change occurs is of smaUer 
order of magnitude than h. 

It is easy to adapt the method of section 2 to derive differential 
equations for the probabilities P n (t) of finding the system at time t in 
state E n . To calculate P n (t + h) we note that at time t + h the system 
can be in state E n only if one of the following conditions is satisfied: 

(1) at time t the system is in E n and during ( t , t + h) no change occurs; 

( 2 ) at time t the system is in I? n -i and a transition to E n occurs; (3) 
at time t the system is in E n +i and a transition to E n occurs; (4) during 
(t, t + h) two or more transitions occur. By assumption, the prob¬ 
ability of the last situation tends to zero faster than h. The first three 
contingencies are mutually exclusive, so that their probabilities add. 
We get therefore 

Pn(t + h) = Pn(t){ 1 - \ n h - p n h) 

+ K-ihP n -i(t) + Hn+ihP n +i(t) + o(h). 

Transposing the term P n (t) and dividing the equation by h, we get 
on the left the difference ratio of P n {t). Letting h —»0, we get 

(5.2) P n '(0 = -(X n + !X n )P n {t) + \n-xPn-xd) + M»+lP»+10). 

This equation holds forn > 1 . For n = 0 in the same way 

(5.3) P o '(0 - -XoP o (0 + 

If at time zero the system is in state E i} then again the initial condition 

(5.4) Pi(! 0) = 1, P n (0) =0 for n * i 

holds. 

In (5.2)-(5.4) we have the fundamental equation of the birth and 
death process. In fact, the coefficients \ n and u n can be arbitrarily 
prescribed; the differential equations (5.2) and (5.3), together with the 
initial conditions (5.4), then uniquely determine the corresponding system 
of probabilities P n (t). This assertion is by no means obvious. In the 
case of a pure birth process we had also an infinite system of differential 
equations; however, the system (3.2) had the form of recurrence 
relations where P n (t) can be calculated from P n -i(0, and P o (0 is 
determined by the first equation. The new system (5.2) is not of this 
form, and all P n {t) must be calculated simultaneously. A complete 
proof of the existence and uniqueness of a system of solutions \P n {t)} 
is lengthy and will be omitted . 9 It turns out that we have always 

9 This assertion is a special case of the more general theorem of section 9. 



373 


17.51 THE BIRTH AND DEATH PROCESS 

P„(t) > 0 and 

(5.5) ZP n (t) < 1. 

In cases of practical interest the equality sign holds. 

If Ao = 0, then the transition E 0 —* Ei is impossible. In the termi¬ 
nology of Markov chains E 0 is an absorbing state from which no exit is 
possible; once the system is in E 0 it stays there. From (5.3) it follows 
that in this case P 0 '(l) > 0, so that P 0 (t) increases monotonically. The 
limit P 0 (°°) is the probability of ultimate absorption. 

More generally, it can be shown that the limits 

(5.6) lim P n (t ) = p n 

t —►00 

exist and are independent of the initial conditions (5.4); they satisfy the 
system of linear equations obtained from (5.2)-(5.3) on putting 10 

iV(0 =0. 

This statement can be proved either from the explicit formulas 11 
for the P n (t) or from general ergodic theories. Intuitively the theorem 
becomes almost obvious by a comparison of our process with a simple 
Markov chain with transition probabilities 

(5.7) Pn . n+1 = —, jv-i-Ht— 

In this chain the only direct transitions arc E n —> E n+ 1 and E n —> E n _ ly 
and they have the same conditional probabilities as in our process; the 
difference between the chain and our process lies in the fact that, 
with the latter, changes can occur at arbitrary times, so that the 
number of transitions during time t is a random variable. However, 
for large t this number is certain to be large, and hence it is plausible 
that for t —>oo the probabilities P n (t) behave as the corresponding 
probabilities of the simple chain. 

The principal field of applications of the birth and death process is 
to problems of waiting times, trunking problems, etc. Such applica¬ 
tions will be discussed in sections G and 7. 

10 This is the so-called “steady-state condition.” It must, be underwood that 
despite the suggestive name no steady state ever is reached except when Eo is an 
absorbing state. In general the chance fluctuations continue unabated, and the ex¬ 
istence of the limits (5.6) only indicates that in the long run the influence of the 
initial state disappears. The steady state is of so-called statistical equilibrium. 

11 W. Feller, On the Integrodifferential Equations of Completely Discontinuous 
Markov Processes, Transactions of the American Mathematical Society , vol. 48 
(1940), pp. 488-515. 



374 


STOCHASTIC PROCESSES 


[17.5 


Example. Linear Growth . Suppose that a population consists of 
elements which can split or die. During any short time interval of 
length h the probability for any living element to split into two is 
XA + o(h), while the corresponding probability of dying is ph + o(h). 
Here X and n are two constants characteristic of the population. If 
there is no interaction among the elements, then we are led to a birth 
and death process with X n = nX, /x n = n/x. The basic differential 
equations take on the form 

Po'tt) = mPi(0, 

(5.8) 

Pn(t) = — (X + n)nP n (t) + X(n — 1 )P n _i(0 + + l)P n+1 (0 

(n- 1,2, ...). 

Explicit solutions can be found 12 (cf. problems 7-9), but we shall 
not discuss this aspect. It can be shown that the limits (5.6) exist. 
They obviously satisfy (5.8) with P n '(t) = 0. From the first equation 
we find p\ = 0, and then we see by induction from the second equation 
that p n = 0 for all n > 1. If p 0 = 1, we may say that the probability 
of ultimate extinction is 1. If p 0 < 1, then the relations pi = p 2 
... = 0 imply that with probability 1 — p 0 the population increases 
over all bounds; ultimately the population must either die out or 
increase indefinitely. To find the probability p 0 of extinction we com¬ 
pare the process to the related Markov chain. In our case the transition 
probabilities (5.7) are independent of ra, and we have therefore an 
ordinary random walk in which the steps to the right and left have 
probabilities p = X/(X + m) and q — /x/(X + /x), respectively. The 
state E 0 (or x = 0) is an absorbing barrier. We know from the classical 
ruin problem (Chapter 14, section 2) that the probability of extinction 
is 1 if p < q and ( q/p) r if q < p and r is the initial state. We conclude 
that in our process the probability p 0 = lim P 0 (t) of ultimate extinction 
is 1 if X < m, and (n/\) r if X > /x- (This is easily verified from the 
explicit solution; cf. problem 8.) 

As in many similar cases, the explicit solution of (5.8) is rather 
complicated, and it is desirable to calculate the mean and the variance 

l * A systematic way consists in deriving a partial differential equation for the 
generating function EP n (0® n * A more general process [where the coefficients 
X and m in (5.8) are permitted to depend on time] is discussed in detail in David G. 
Kendall, The Generalized “Birth and Death” Process, Annals of Mathematical 
Statistic8 f vol. 19 (1948), pp. 1-15. Cf. also a recent paper by the same author, 
Stochastic Processes and Population Growth (to appear in 1950 in the Journal of 
the Royal Statistical Society) , where the theory is generalized so as to take account 
of the age distribution in biological populations. 



17.6] EXPONENTIAL HOLDING TIMES 

of the distribution { P n (t )}. Write for the mean 


375 


(5.9) M(t) = £ «P.(0. 

71 = 1 

We shall omit a formal proof that M(t) is finite and that the following 
formal operations are justified (again both points follow readily from 
the solution given in problem 8). Multiplying the second equation 
in (5.8) by n and adding over n = 1, 2, ..we find that the terms 
containing n 2 cancel, and we get 

(5.10) M'(t) = X2(rc - l)iPn—i(0 - yiZin + l)P n +i(0 

- (X - 

This is a differential equation for M(t). At time t = 0 the population 
size is iy and hence A/(0) = i. Therefore * 

(5.11) M(t) = 

We see that the mean tends to 0 or infinity, according as X < n or 
X > n. The variance of {P n (0J can be calculated in a similar way 
(cf. problem 10). 

6. Exponential Holding Times 

The principal field of applications of the pure birth and death 
process is connected with trunking in telephone engineering and various 
types of waiting lines for telephones, counters, or machines. This 
type of problem can be treated with various degrees of mathematical 
sophistication. The method of the birth and death process offers the 
easiest approach, but this model is based on a mathematical simplifica¬ 
tion known as the assumption of exponential holding times . We begin 
with a discussion of this basic assumption. 

For concreteness of language let us consider a telephone conversa¬ 
tion, and let us assume that its length is necessarily an integral number 
of seconds. We treat the length of the conversation as a random 
variable X and assume its probability distribution p n = Pr[X = n\ 
known. The telephone line then represents a physical system with 
two possible states, “busy” (E 0 ) and “free” (E x ). If at an arbitrary 
moment t the line is busy, then the probability of a change in state 
during the next second depends on how long the conversation has been 
going on. In other words, the past has an influence on the future, 
and our process is therefore not a Markov process (cf. Chapter 15, 
section 10). This circumstance is the source of most difficulties in 



STOCHASTIC PROCESSES 


376 


[17.6 


more complicated problems. However, there exists a simple excep¬ 
tional case. 

Imagine that the decision as to whether or not the conversation is 
to be continued is made each second at random by means of a skew 
coin. In other words, a sequence of Bernoulli trials with probability p 
of success is performed at a rate of one per second and continued until 
the first success. The conversation ends when this first success occurs. 
In this case the total length of the conversation, the “holding time,” 
has the geometric distribution p n = (f^p. If at any time t the line is 
busy, the probability that it will remain busy for more than one second 
is q, and the probability of the transition E 0 —> 2?i at the next step is p. 
These probabilities are now independent of how long the line was busy. 

This situation has been discussed at length in Chapter 11, section 4. 
We found there that in passing to the limit we get a process with a 
continuous time parameter, and the geometric distribution approaches 
an exponential function (whence the name “exponential holding time.”) 
In the limit we have then the following situation. The probability 
that a conversation starting at time 0 extends beyond time t is e ~ td ; 
if at any time t the line is busy, the probability of a change in state 
during (t, t + h) is yh plus terms which tend to zero faster than h. 

The method of the birth and death process is applicable only if the 
transition probabilities in question do not depend on the past; for 
trunking and waiting line problems this means that all holding times 
must be exponential. From a practical point of view this assumption 
may at first sight appear rather artificial, but experience shows that it 
reasonably describes actual phenomena. In particular, many measure¬ 
ments have shown that telephone conversations within a city 13 follow 
the exponential law to a surprising degree of accuracy. 

These remarks apply to holding times (e.g., length of telephone 
conversations, duration of machine repairs, etc.). We must also 
characterize the so-called incoming traffic (arriving calls, machine 
breakdowns, etc.). We shall assume that during any time interval of 
length h the probability of an incoming call is \h plus negligible terms, 
and that the probability of more than one call is in the limit negligible. 
According to the results of section 2, this means that the number of 
incoming calls has a Poisson distribution with mean \t. We shall 
describe this situation by saying that the incoming traffic is of the Poisson 
type with intensity A. 

u For conversations between cities, companies usually charge by intervals of 3 
minutes, and the holding times are therefore likely to be multiples of 3 minutes. 
This is a systematic deviation from the exponential law, and our theory does not 
apply. 



17.7] 


WAITING LINES 


377 


7. Waiting Line and Servicing Problems 

(a) The Simplest Trunking Problem. 14 Suppose that infinitely 
many trunks or channels are available, and that the probability of a 
conversation ending during the interval ( [t , t + h) is yh plus terms 
which are negligible as h 0 (exponential holding time). The incom¬ 
ing calls constitute a traffic of the Poisson type with parameter X. The 
system is in state E n if n lines are busy. 

It is, of course, assumed that the durations of the conversations are 
mutually independent. If n lines are busy, the probability that one 
of them will be freed within time h is then n\ih + o(h). The prob¬ 
ability that within this time two or more conversations terminate is 
obviously of the order of magnitude h 2 and therefore negligible. The 
probability of a new call arriving is \h + o(h). The probability of a 
combination of several calls, or of a call arriving and a conversation 
ending, is again of order of magnitude h 2 . Thus, in the notation of 
section 5 


(7.1) \ n — X, fx n — i ifi. 

The basic differential equations (5.2)-(5.3) take the form 

IV(0 = -XPoto + ixPlit) 

(7.2) Pn'(t) = -(A + w)P n (f) 

+ \Pn-i(t) + (n+ 1 )mP«+i(0 (n > 1). 

Explicit solutions can be obtained by deriving a partial differential 
equation for the generating function (cf. problem 11). We shall only 
determine the limits (5.6). They satisfy the equations 


(7.3) 


Xpo = MPi 

(X + n\l)p n = Xpn-l + (n + l)nVn+l- 


One finds by iteration that p n = Vo(X/u) n /n\, and hence 


(7.4) 


Vn 


= e~ x/M 


(VmT 

n\ 


14 C. Palm, Intensitatsschwankungon im Fernsprechverkehr, Ericsson Technics 
(Stockholm), no. 44 (1943), pp. 1-189, in particular p. 57. Palm studies approxi¬ 
mations in the case where X and m are periodic functions of t. Problems of this 
type were studied by Erlang (1878-1929), whose work was a predecessor of the 
general theory of stochastic processes. See the recent book by E. Brockmeyer, 
II. L. Halstrom, and Arne Jensen, The Life and Works of A. K. Erlang t Transac¬ 
tions of the Danish Academy Technical Sciences , No. 2, Copenhagen, 1948. Inde¬ 
pendently valuable pioneer work has been done by T. C. Fry; his book quoted on 
p. 107, did much for the development of engineering applications of probability. 



STOCHASTIC PROCESSES 


378 


[17.7 


Thus, the limiting distribution is a Poisson distribution with parameter 
X/y. It is independent of the initial state. 

It is easy to find the mean M(t) = 2nP n (0. Multiplying the nth equation of 
(7.2) by n and adding, we get [taking into account that the P n (0 add to unity] 


(7.5) 


M'(t) - X - nM(t). 


If the initial state is E», then M(0) = i, and 


(7.6) 


Mit) = - (1 - e~ M< ) + ie~ M . 

M 


As t —> oo, we see that M(t) approaches the mean of the Poisson distribution found 
above. Incidentally, the reader may verify that in the special case i =* 0 the 
P n (t) are given exactly by a Poisson distribution with mean M(t). 


(b) Waiting Lines for a Finite Number of Channels . 15 We now treat 
the last example in a more realistic way. The assumptions are the 
same, except that the number a of trunklines or channels is finite. If all 
channels are busy ) each new call joins a waiting line and waits until a 
channel is freed . This means that all trunklines have a common wait¬ 
ing line. 

The word “trunk” may be replaced by counter at a postoffice and 
“conversation” by service. We are actually treating the general waiting 
line problem for the case where a person has to wait only if all a channels 
are busy. 

We say that the system is in state E n if n is the total number of persons 
who are either being served or are in the waiting line. A waiting line 
exists only when the system is in a state E n with n > a, and then 
there are n — a people in the waiting line. 

As long as at least one channel is free, we are in exactly the same 
situation as in the preceding example. However, if the system is in a 
state E n with n > a, then only a conversations are going on, and we 
have therefore y n = ay, for n > a. The basic system of differential 
equations is therefore given by (7.2) for n < a, but for n > a by 

(7.7) P n '(t) = -(X + an)Pn(t) + X P n _!(0 + OMPn+lCfl. 

We again investigate the limits (5.6) (which can be shown to exist). 
They must satisfy the equations (7.3) if n < a and 

(7.8) (X + ay)p n = \p n —! + a/xpn+i 

u A. Kolmogoroff, Sur le probteme d’attente, Recueil MaMmatique [Sbomik ], 
Vol. 38, 1931, pp. 101-106. 



17.7] WAITING LINES 

if n> a. By recursion we find again that for rc < a 


379 


(7.9) 

while for n > a 

(7.10) 


Pn = P 0 - 


(VVT 

n! 1 


Pn 


(X/m)" 


a!a n ~° 


—nPo- 


The series ]C(Pn/Po) converges only if 


(7.11) - < a. 

M 

Hence, if (7.11) does not hold, a limiting distribution \pk\ cannot 
exist. In this case p n = 0 for all n , wAicA means that gradually the 
waiting line grows over all hounds. On the other hand, if (7.11) holds, 
then we can determine p 0 so that the sum of the expressions (7.9) and 
(7.10) equals unity. From the explicit expressions for P n (t ) (which 
we have not derived, however), it can be shown that the p n thus ob¬ 
tained really represent the limiting distribution of the P n (t ). Table 1 
gives a numerical illustration for a = 3, \/y = 2. 


TABLE 1 

Limiting Probabilities in the Case of a = 3 Channels and X/m = 2 

n 01234567 

Lines busy 0 1233333 

People waiting 00001234 
Vn 0.1111 0.2222 0.2222 0.1481 0.0988 0.0658 0.0439 0.0293 


(c) Servicing of Machines. 1 * The results derived in this and the next 
example are being successfully applied in Swedish industry. For 
orientation we begin with the simplest case and generalize it in the 
next example. The problem is as follows. 

We consider automatic machines which normally require no human 
care. However, at any time a machine may break down and call for 
service. The time required for servicing the machine is again taken 

16 Examples (c) and (d), including the numerical illustrations, are taken from an 
article by C. Palm, The Distribution of Repairmen in Servicing Automatic Ma¬ 
chines (in Swedish), Industritidningen Norden , vol. 75 (1947), pp. 75-80, 90-94, 
119-123. Palm gives tables and graphs for the most economical number of repair¬ 
men. 



380 


STOCHASTIC PROCESSES 


[17.7 


as a random variable with an exponential distribution. In other 
words, the machine is characterized by two constants X and y with 
the following properties. If at time t the machine is in working state, 
the probability that it will call for service before time t + h is \h plus 
terms which are negligible in the limit h —> 0. Conversely, if at time 
t the machine is being serviced, the probability that the servicing time 
terminates before t + h and the machine reverts to the working state 
is yh + o(h). For an efficient machine X should be relatively small and 
y relatively large. The ratio \/y is called the servicing factor . 

We suppose that m machines with the same parameters X and y are 
serviced by a single repairman . If a machine breaks down, it will be 
serviced immediately unless the repairman is servicing another ma¬ 
chine, in which case a waiting line is formed. We say that the system 
is in state E n if n machines are not working. For 1 < n < m this means 
that one machine is being serviced and n — I are in the waiting line; 
in the state E 0 all machines work and the repairman is idle. The m 
machines are assumed to work independently. 

A transition E n —> E n +1 is caused by a breakdown of one among the 
m — n working machines, while a transition E n —> 2£ n -i occurs if the 
machine being serviced reverts to the working state. Hence we have a 
birth and death process with coefficients 


(7.12) 


Xo = m\, fiQ = 0, 

X n = (m — n)X, y n = /x 


(0 < n < m), and the basic differential equations (5.2) and (5.3) 
become (1 < n < m — 1): 

Po(t) = - m\P 0 (t ) + yPi(t), 

Pn(t) = - {(ra - n)X + y\P n (t) + (m - n + l)XP n _i(2) 

(7.13) 

4“ M-Pn+l(0> » 

Pm'it) = "AW + XP m — i(t). 


This is a finite system of differential equations and can be solved 
by ordinary methods. The limits (5.6) exist and satisfy the equations 


m\po = yp u 

(7.14) {(m - n)X + y\p n = (m - n + l)Xp n -i + MP*+i, 

Ppm = Xp m — - 1 * 



17.7] WAITING LINES 381 

It follows easily that the recursion formula 

(7.15) (m - n)\p n = M p*+i 
holds. From it we get 

(7.16) p n = (m) n (^) Po, 

where (m) n = m(m — 1) • • • (m — n + 1). The value of p 0 follows 
from the condition ^>p n = 1. Table 2 gives the values of p n for the 
case of m = 6 machines and a servicing factor \/p = 0.1. The table 
also exhibits a simple way of calculating p n . 

The probability p 0 may be interpreted as the probability of the 
repairman being idle (in the example of Table 2 he is likely to be idle 
almost half the time). The expected number of machines in the waiting 
line is 

VI 

(7.17) w = £ (k - 1 )p k . 

k—2 

This quantity can be calculated by adding the relations (7.15) for 
n = 0, 1, • • •, in. Using the fact that the p n add to unity, we get 

(7.17) m\ — \w — X(1 — po) = /i(l — po) 


or 

X -f- n 

(7.18) w = m -(1 — po). 

X 

In the example of Table 2 we have w = 6(0.0549). Thus 0.0549 is 
the average contribution of a machine to the waiting line. 


TABLE 2 


Probabilities p n for the Case X/ju 

Machines in 

= 0.1, in 

n 

Waiting Line 

Pn/P0 

Pn 

0 

0 

1 

0.4845 

1 

0 

0.6 

.2907 

2 

1 

0.3 

.1454 

3 

2 

0.12 

.0582 

4 

3 

0.036 

.0175 

5 

4 

0.0072 

.0035 

6 

5 

0.00072 

.0003 


1 = 2 — = 2.06392 
Po Po 



STOCHASTIC PROCESSES 


[17.7 


(d) Continuation: Several Repairmen . We shall not change the basic 
assumptions of the preceding problem, except that the m machines are 
now serviced by r repairmen (r < m). Thus for n < r the state E n 
means that r — n repairmen are idle, n machines are being serviced, 
and no machine is in the waiting line for repairs. For n > r the state 
E n means that r machines are being serviced and n — r machines are 
in the waiting line. We can use the set-up of the preceding example 
except that (7.12) is obviously to be replaced by 

X 0 = mX, mo = 0, 

(7.19) X n = (m — n)X, Mn = ^m (1 < n < r), 

X n = (m — n)X, Mn = r M (r < n < m). 

We shall not write down the basic system of differential equations, but 
only the equations for the limiting probabilities p n . They are 


mXpo = MPi, 

(7.20) {(m - n)X + nn\p n = (m - n + l)Xp n _i + (rc + 1 )mP»+i 

(1 < n < r), 

{(m - n)X + rv)p n = (m - n + l)\p n -i + r/ip w +i 

(r < n < m). 

From the first equation we get the ratio of pi/po- From the second 
equation we get by induction for n < r 

(7.21) (n + l)MPn+i = (m - n)Xp n ; 
finally, for n > r we get from the last equation in (7.20) 

(7.22) r MPn+i = (m - n)Xp n . 

These equations permit calculating successively the ratios p n /po- 
Finally, p 0 follows from the condition S p k = 1. The values in Table 3 
are obtained in this way. 

A comparison of Tables 2 and 3 reveals surprising facts. Note that 
both tables refer to the same machines (X/m = 0.1), but in the second 
case we have m = 20 machines and r = 3 repairmen. The number of 
machines per repairman has increased from 6 to 6%, but at the same 
time, the machines are serviced more efficiently. Let us define a 



17.71 


WAITING LINES 


383 


coefficient of loss for machines by 


(7.23) 


w average number of machines in waiting line 
m number of machines 


and a coefficient of loss for repairmen by 

(7 24) P avera £ e number of repairmen idle 

r number of repairmen 

For practical purposes we may identify the probabilities P n (t ) with 
their limits p„. In Table 3 we have then w = p 4 + 2 p 5 + 3 p& + • • • 
+ 17^20 and p = 3po + 2pi + p 2 - Table 4 proves conclusively that 
for our particular machines for which (X/p = 0.1) three repairmen per 
20 machines are ever so much more economical than one repairman per 
6 machines. Palm’s tables referred to in footnote 1G enable us to find 
the most economical ratio of repairmen per machine. 


TABLE 3 


Probabilities p n for the Case X/p = 0.1, m =* 20, r = 3 



Machines 

Machines 

Repairmen 


n 

Serviced 

Waiting 

Idle 

Pn 

0 

0 

0 

3 

0.13625 

1 

1 

0 

2 

.27250 

2 

2 

0 

1 

.25888 

3 

3 

0 

0 

.15533 

4 

3 

1 

0 

.08802 

5 

3 

2 

0 

.04694 

6 

3 

3 

0 

.02347 

7 

3 

4 

0 

.01095 

8 

3 

5 

0 

.00475 

9 

3 

6 

0 

.00190 

10 

3 

7 

0 

.00070 

11 

3 

8 

0 

.00023 

12 

3 

9 

0 

.00007 



TABLE 4 



Comparison of Efficiencies of Two Systems Discussed in Examples 


(c) and ( d ) 

I 

II 

Number of machines 

6 

20 

Number of repairmen 

1 

3 

Machines per repairman 

6 

6% 

Coefficient of loss for repairmen 

0.4845 

0.4042 

Coefficient of loss for machines 

0.0549 

0.01694 



384 


STOCHASTIC PROCESSES 


[17.8 


(e) A Power-supply Problem. 17 One electric circuit supplies a welders 
who use the current only intermittently. If at time t a welder uses 
current, the probability that he ceases using it at time t + h is 
ph + o(h); if at time t he requires no current, the probability that he 
calls for current before t + h is \h + o{h). The welders work inde¬ 
pendently of each other. 

We say that the system is in state E n if n welders are using current. 
Thus we have only finitely many states 2? 0 , • * *, Pa¬ 
li the system is in state E n) then a — n welders are not using current 
and the probability for a new call for current within time h is 
(a — ri)\h + o(h); on the other hand, the probability that one of the n 
welders ceases using current is nph + o(h). Hence we have a birth 
and death process with 

(7.25) \ n = (a — n)A, p n = np, 0 < n < a. 

The basic differential equations become 

Po'(0 = —a\Po(t) + pP\{t), 

(7.26) P n ’(t ) = -{n M + (a- n)\}P n (t) + (n + l)pP n+ i(t) 

+ (a - n + l)AP n _i(0, 

Pa(t) = -apP a {t) + XPa-lW 
(with 1 < n < a — 1). 

It is easily verified that the limiting probabilities are given by the 
binomial distribution 


(7.27) 



a result which could have been anticipated on intuitive grounds. 


8. The Backward (Retrospective) Equations 

In the preceding sections we were studying the probabilities P n (0 
of finding the system at time t in state E n . This notation is convenient 
but misleading, inasmuch as it omits mentioning the initial state E{ 
of the system at time* zero. For theoretical purposes it is therefore 
more natural to introduce the notation P tn (0; this is the probability 
that the system is at time t in state E n , given that at time zero it was in E{. 
The Pin(t) will be called transition probabilities . 

17 This example was suggested by the problem treated (inadequately) by H. A. 
Adler and K. W. Miller, A New Approach to Probability Problems in Electrical 
Engineering, Transactions of the American Institute of Electrical Engineers, vol. 65 
(1946), pp. 630-632. 



17.8] 


BACKWARD EQUATIONS 


385 


It must be emphasized that we have been studying these transition 
probabilities all along and that nothing is changed but notation. If 
the initial state is known to be E iy then JP; n (0} is the absolute prob¬ 
ability distribution at time t. If at time zero we have only a probability 
distribution {&•} for the initial state, then the probability of E n at 
time t is 

(8.1) Q n (t) = E q iPUt). 

i 


In the case of the pure birth process and of the birth and death 
process, we found that for an arbitrary fixed i the transition 'probabilities 
Pin (0 satisfy the basic differential equations (3.2) and (5.3). The sub¬ 
script i appears only in the initial conditions, which should now be 
written 


( 8 . 2 ) 


P in (0) 


X 

0 


if n = i 
otherwise. 


These basic differential equations were derived by prolonging the 
time interval (0, t) to (0, t + h) and considering the possible changes 
during the short time ( t , t + h). We could as well have prolonged the 
interval (0, t) in the direction of the past and considered the changes 
during ( — h, 0). In this way we get a new system of differential 
equations in which n (instead of i) remains fixed. 

Consider first the case of a pure birth process and let us neglect 
events whose probability tends to zero faster than h. If the system 
passed from Ei (i > 0) at time —h to E n at time t 1 then at time 0 it 
must be either at Ei or at E i+ 1 . By the method of sections 2 and 3 
we conclude that 


(8.3) P in (t + h) = P in m 1 - Kh) + P i+l ,n(t)\h + o(h). 
Hence for i > 0 the new basic system now takes the form 

(8.4) P in '(t) = -\iP in (t) + \iP i+l ,n(t)> 
and 


(8.5) POn (t) = -A O Pon(0. 

These equations are called the backward equations , and, for distinction, 
(3.2) are called the forward equations . The initial conditions are (8.2). 

In the case of the birth and death process, if the system is at time —h 
in E^ then at time t it must be in E { + u E { , or i, and the same argu- 



386 


STOCHASTIC PROCESSES 


[ 17.9 


ment leads to the backward equations 

(8.6) Pin'd) = -(X* + m)Pi,nd) + \Pi+l.nd) + M<P<-l..i(t). 

These equations correspond to (5.3). 

It should be clear that the forward and backward equations are not 
independent of each other: the solution of the backward equations with 
the initial conditions (8.2) automatically satisfies the forward equations. 
These connections are mentioned here only as a preparation for the 
general theory of the next section. 

Example. The Poisson Process . In section 2 we have interpreted 
the Poisson expression (1.1) as the probability that exactly n calls 
arrive during any time interval of length t . Let us now measure time 
from an arbitrary moment, and let us say that the system is in state E n 
if exactly n calls arrive up to time t. Then a transition from Ei at t\ 
to E n at £ 2 means that n — i calls arrived during d\> tz). This is 
possible only if n > i, and hence we have for the transition probabilities 
of the Poisson process 

(xO n “* 

Pind) = e u - if n > ? 

(n - i)l 

(8.7) P in ( 0) =0 if n < ?. 

The forward and backward equations are, respectively, 

(8.8) Pin'd ) = -*Pind) + XP;.n-l(0 
and 

(8.9) Pin'd) « -\Pind) + XP Wi *(0, 

and it is easily verified that (8.7) is a solution of either system, and 
satisfies the initial condition (8.2). 

9. Generalization; the Kolmogorov Equations 
Up to now we considered exclusively processes in which direct transi¬ 
tions from a state E n were possible only to the neighboring states 
E n + 1 and E n - 1 . Moreover, the processes were time-homogeneous, 
that, is to say, the transition probabilities Pind) were the same for all 
time intervals of length t. We now consider more general processes 
in which both assumptions are dropped. 

As in the theory of ordinary Markov chains (Chapter 15), we shall 
permit direct transitions from any state Ei to any state 2? n . The 
transition probabilities are permitted to vary in tifrie. This necessitates 



17.9] 


THE KOLMOGOROV EQUATIONS 


387 


stating both end points of any time interval instead of noting only its 
length. Accordingly, we shall write P in (r, t) for the conditional prob¬ 
ability of finding the system at time t in state E n) given that at a previous 
instant r the state was E{. The symbol Pi n (r, t) is meaningless unless 
t < t. If the process is homogeneous in time, then P in (r, t) depends 
only on the difference t — r, and we can write P in (0 instead of 
Ptn(r, r + t) (which is then independent of r). 

The principal property of our processes is the Markov property 
discussed in Chapter 15, section 10. It states that, given the state of 
the system at any time, future changes are independent of the past. 
More precisely, consider three moments r < s < t and suppose that 
at time r the system is in state E{ and at time s in state E v . For a 
general process the (conditional) probability of finding the system at 
time t in state E n depends on both i and v\ in other words, not only 
the “present state” E vy but also the past state E iy has an influence on 
the state at time t. However, for a Markov process this is not so. 
For it the considered probability equals P„ n (s, t), the probability of a 
transition from E v at time s to E n at time t ; the knowledge that at time 
t < s the system was in state Ei permits no inference as to the future. 
This assumption leads directly to an important conclusion. The 
passage from Ej at time r to E n at time t must occur via some state E v 
at time $, and for a Markov process the probability that the passage 
goes via a particular state E v is P,v(r, s)P vrt (s , t). It follows that we 
must have 

(9.1) Pi„(r, 0 = Z PU r, *)/%»(«, 0 

V 

identically for all t < s < t. This is the Chapman-Kolmogorov equation. 
It is the counterpart to the equation (10.3) of Chapter 15, which is valid 
when the time parameter assumes integral values only. 

It was shown in Chapter 15, section 10, that the Chapman-Kol¬ 
mogorov equation does not hold for all stochastic processes. For our 
purposes we could take (9.1) as defining the class of processes with which 
we are concerned . 18 In fact, we shall add only regularity restrictions 
and derive our basic differential equations from (9.1). There is a 
probabilistic background leading up to the Chapman-Kolmogorov 
equation, but we need not refer to it: once (9.1) is given we can easily 

18 The question of whether the Kolmogorov equation characterizes Maikov proc¬ 
esses poses difficult problems requiring the study of the actual sample functions 
X{t). It should be borne in mind that we are using a short-cut to obtain differ¬ 
ential equations for certain probabilities and are not analyzing the process in all 
its aspects. 



388 STOCHASTIC PROCESSES [17.9 

derive differential equations which determine the probabilities P tn (£), 
and can proceed in a purely analytical way. 

In the case of time-homogeneous processes, (9.1) assumes the simpler 
form 

(9.2) P in (t + s) = £ PiM)P,n(s). 

V 


For the Poisson process this equation reduces to the convolution 
property of the Poisson distribution stated in Chapter 11. 

We now introduce our fundamental regularity conditions which in 
an obvious way generalize the starting assumptions of the birth and 
death process. 


Assumption 1. To every state E n there corresponds a continuous 
function c n {t) > 0 such that as h —> 0 


(9.3) 

uniformly in t. 


1 Pnn(l) t k) 

-:-» c „(0 




The probabilistic interpretation of (9.3) is obvious: if at time t the 
system is in state E n , then the probability that during (<, t + h) a 
change occurs is c n {t)h + o(h). Analytically, (9.3) requires that 
Pnn(tj s) —> 1 as s-+t, and that P nn (2, x) has at x = t a derivative 
with respect to x . The function c n (t) plays the role of X n + Mn in the 
birth and death process. In the case of a time-homogeneous process, 
c n is a constant. 


Assumption 2. To every pair of states Ej , Ek with j 7 * k there corre¬ 
spond transition probabilities Pjk{t) {depending on time) such that as 
h-> 0 

Pjk(t, t + h) * 

(9.4) ---► Cj(t)pj k (t) O' * k) 

h 


uniformly in L The pjk{t) are continuous in t, and for every fixed t y j 
(9.5) X p jk (t) = 1, pjj(t) = 0. 

k 

Here p#(<) can be interpreted as the conditional probability that, 
if a change from Ej occurs during it, t + h), this change takes the 
system from Ej to Ek. In the birth and death process 


+ Hj ’ 


N 


\j + Pj ’ 


(9.6) 


Pi,i+ 1(0 = 





17.9] 


389 


THE KOLMOGOROV EQUATIONS 

while Pjk(t) = 0 for all other combinations of j and ft. For every fixed 
t the Pjk(t) can be interpreted as transition probabilities of a Markov 
chain. 

The two assumptions suffice to derive a system of backward equa¬ 
tions for the Pjk(r } t ), but for the forward equations we require in 
addition 

Assumption 8. For fixed k the passage to the limit in (9.4) is uniform 
with respect to j. 

The necessity of this assumption is of considerable interest for the 
theory of infinite systems of differential equations and will be discussed 
in the next section. 

We now derive a system of differential equations for the Pair, t) 
as functions of t and n (forward equations). From (9.1) we have 

(9.7) Pik(j } t + h) = PiA 7 j t)Pjk(t, t + h). 

j 

If we express the term P k k(t , t + h) on the right in accordance with 
(9.3), we get 

Pik(r } t + h) — Pjkir, t) 

(9.8) h j 

= ~ Ck(t)Pik{r , 0+:E P*j(?> 0 Pjk(t, t + h) H- 

hj*k 

where the neglected terms tend to 0 with h } and the sum extends over 
all j except j = k. We can now apply (9.4) to the terms of the sums. 
Since (by assumption 3) the passage to the limit is uniform in j, the 
right side has a limit. Hence also the left side has a limit, which means 
that P ik (T, t) has a partial derivative with respect to t , and 

(9.9) = - Ci (t)P ik (r, t) + £ Pifr, t)c^t)p jk (t). 

at i 

r 

This is the basic system of forward differential equations. Note that, 
in it, i and r are fixed so that we have (despite the formal appearance 
of a partial derivative) a system of ordinary differential equations for 
the infinite system of functions P t fc(r, t), k = 0,1, 2,- The param¬ 

eters i and r appear only in the initial condition 


(9.10) 


1 

Pik(r , t) = Q 


if ft = t 
otherwise. 



390 STOCHASTIC PROCESSES [17.9 

In like manner we can derive a system of backward equations, 
starting from 

(9.11) P fl (r - h, t) - D P{,(t - h, t)P, k(j, t) 


and applying our assumptions to the P»„(t — h, r). We get 


(9.12) 


Pik(r — h,t) — Pik{r, t) 

h 


— Ci(r)P,'i(r, t) 


+ 7 Z) Piv (j — K T)P„k(T, t) + 
ifi 


0(h) 

h 


Here P 1v (t — h, r)/h —»c./(r)p l >(r), and the passage to the limit is 
always uniform, since (without using assumption 3) by (9.4) and (9.3) 


(9.13) 


r X) t + h) 

n 


1 — P ii(ty t + li) 
h 


-* Ci(t ) = 2 Cf(0p.v(0- 

V 


[This means that (9.4) may be summed over /c.J It follows then from 
(9.12) that 

dPikir. t) ___ 

(9.14) --- = Ci(r)Pik(T, t ) — Cj(r) 23 Piv(j)Pyk(j, 0 - 

dr „ 


This is the basic system of backward dijfercMtial equations. In it k 
and t are fixed, and we have the initial condition 


(9.15) 


Pik(t, t) = 


1 

0 


if i = k 
otherwise. 


The two systems of differential equations were first derived by 
A. Kolmogorov in a celebrated paper 19 in which he laid the foundations 
of the theory of Markov processes (of more general types than here 
considered). It can be shown 20 that each of the two systems uniquely 

19 tlber die analytischen Methoden in der Wahrscheinlichkeitsrechnung, Mathe- 
matische Annalen , vol. 104 (1931), pp. 415-458. 

10 Cf. W. Feller, On the Integrodifferential Equations of Purely Discontinuous 
Markoff Processes, Transactions of the American Mathematical Society , vol. 48 
(1940), pp. 488-515. There also necessary and sufficient conditions for (9.16) are 
given. Unfortunately, the paper treats a more general class of processes, so that 
our simple differential equations are replaced by much more complicated integro- 
differential equations. Previously Kolmogorov gave a partial existence proof, but 
under very restrictive conditions. 



17.10] 


DEGENERATE PROCESSES 


391 


determines a system of transition probabilities Pjk(r, t) satisfying all 
our conditions, including the Chapman-Kolmogorov equation (9.1). 
We know from the case of the pure birth process (section 4) that the 
Pjk(r, t ) are not always a proper probability distribution, but that 
sometimes 

(9.16) E Pair, t ) < 1, 

k 

where the difference between the two sides accounts for the possi¬ 
bility of infinitely many transitions in a finite time interval. From 
the point of view of applications the possibility (9.16) can be safely 
disregarded, but it is of interest both for the theory of stochastic proc¬ 
esses and for the theory of infinite systems of differential equations. 

Example. The Compound Poisson Process. Consider the case where 
all Ci(t) equal the same constant 

(9.17) a(f) = X 

and where the pjk are independent of t. In this case they define an 
ordinary Markov chain, and we denote (as in Chapter 15) its higher 
transition probabilities by Pjk {n) - 

From (9.17) it follows that the probability of a transition occurring 
during ( t , t + h) is independent of the state of the system at time t 
and equals \h + o(h). Accordingly the number of transitions within 
the time interval (r, t) has a Poisson distribution with parameter 
\{t — r). The conditional transition probabilities under the assump¬ 
tion that there are n transitions are given by Pj^ n) - Hence we have 

(9.18) P jk (r, t ) = c- X(( - r) £ — - pa (n) 

n—0 n\ 

[where py& (0) equals 1 or 0 according to whether k = j or k j.] It is 
easily verified that (9.18) is in fact a solution of our two systems of 
differential equations. 

* 10. Degenerate Processes 

The theorems concerning Kolmogorov’s two systems of differential 
equations round off the theory and are satisfactory except for two 
somewhat mystifying points. First, the possibility of the inequality 
(9.16) is disturbing, since in this case the transition probabilities do 
not form a proper probability distribution. Second, the derivation 
of the forward equations required an assumption which was not 

* Starred sections treat special topics and may be omitted at first reading. 



392 


STOCHASTIC PROCESSES 


[17.10 


necessary for the backward equations. Doob 21 discovered that the 
two facts are related to each other and connected with the existence 
of a certain type of degenerate process. This discovery is of theoret¬ 
ical interest and also reveals new facts concerning infinite systems of 
ordinary differential equations. The situation is particularly simple 
and intuitive in the case of a pure birth process, and a detailed study 
of this case will contribute to an understanding of the general theory. 

We are concerned with the pure birth process of sections 3 and 4. 
The states are E 0 , E\, ..., and direct transitions are possible only 
from E n to E n+1 . The process is homogeneous in time, so that the 
transition probabilities Pu(t) depend only on the duration t of the time 
interval, not on its position. For fixed i the P%k{t) satisfy the forward 
equations 

( 10 . 1 ) Pik(t ) = —\kPik(t) + 

(where we put X_i = 0); for fixed k we have the backward equations 

(10.2) P ik '(t) = -XiPikit) + XiPi+i,k(t)- 
We know from section 4 that the condition 

(10.3) Zr <0 ° 

A k 

is necessary and sufficient for the degenerate case to occur, that is, that 

(10.4) 2 P ik (.t) < 1 

(at least for large t). The difference of the two sides in (10.4) may be 
interpreted as the probability that infinitely many transitions occur in a 
finite time. In this case the state of the system has, so to speak, moved 
out to infinity. From a “practical” point of view it stays at infinity 
forever after, and this concludes the story. However, Doob remarked 
that the pure mathematician may introduce the assumption that, 
whenever the state has moved out to infinity, it instantaneously 
returns to E 0 (or some other state). The process then continues and 
has all the essential properties of our processes except that we have a 
new type of path. If the system is at time 0 at E 0 and at time t at E&, 
it may have undergone six transitions or infinitely many, moving one 
or more times out to infinity and starting each time afresh at E 0 . The 
transition probabilities of this new process have a curious property. 
They obviously satisfy assumptions 1 and 2 of the preceding section, 

** J. L. Doob, Markoff Chains—Denumerable Case, Transactions of the American 
Mathematical Society, vol. 58 (1945), 455-473. 



17.10] 


DEGENERATE PROCESSES 


393 


and therefore they satisfy the backward equations. 22 They cannot 
satisfy the forward equations, since the solution of this system is 
unique. 28 This explains why a special assumption is required for the 
derivation of the forward equations: this assumption eliminates the 
possibility of a state Ek being reached via infinity. If this assumption 
is dropped, we cannot derive (10.1), but our method leads to (10.1) 
with the equality sign replaced by the sign >. The same statement 
holds in the general case. The backward equations are always satisfied; 
however , in the forward equations the equality sign holds only when 
Pik{t) is interpreted as probability of a transition from Ei to Ek infinitely 
many steps. If also transitions 11 via infinity” are possible , then the 
equality sign in the forward equations must be replaced by the sign >. 

It must be understood that this discussion concerns only a very 
exceptional (and in some respects artificial) case. In general, 24 the 
common solution {Pik(r,t)\ of the forward and backward equation 
satisfies, for every fixed t, the natural condition XPik(t) = 1. In this 
case the solution is unique, and no wandering out to infinity can occur. 

For the theory of infinite systems of differential equations our discus¬ 
sion supplies interesting examples where the solution of a system is not 
unique. In the simple case of (10.2) we get the following 

Theorem . If (10.3) holds, then the infinite system of differential 
equations 

(10.5) yi = — \flji + X.-2/i+i 

has a non-zero solution vrith y,(0) = 0 for all i and for which 

(10.6) 2to(l)< 1, Vi(t) > 0 

for all t. 

We conclude with a purely analytical proof of this theorem and indications of 
how the theorems of the preceding section can be proved in the special case of the 
pure birth process. 

The solutions Pa(t) of (10.1) (where i is fixed) satisfy the initial condition 

(10.7) P*(0) - 1, P ik (! 0) = 0 for ^ i. 

22 Note that this implies that the backward equations have several systems of 
solutions satisfying the same initial conditions. 

28 Equations (10.1) for k = 0, 1, ...» i — 1 form a system of i linear equations 
with initial values P,*(0) = 0. Hence Pik(t) = 0 for k < i by familiar theorems 
on finite systems. For k = i we get P%%{t) — — X»Pu(0» an d all P»&(0 with k > i 
can be calculated recursively. By contrast, (10.2) is not a recursive system since 
the tth equation involves the next higher term P <+ 

24 Cf. the Transaction paper cited in section 9, where the necessary and suffi¬ 
cient conditions for the occurrence of (10.4) are given. 



394 STOCHASTIC PROCESSES [17.10 

It has been noted before that the equations can be solved recursively. It is shown 
in elementary textbooks (and it is easily verified) that the solution is 

Pikd) =0 (k <i) 

(10.8) PM - e _Xi ‘ 

Pikd) = Xt-l f e^-^Pi.k-iM dr (k > i). 

Jo 

We prove first that these Pik(t) satisfy the backward equations (10.2). This 
assertion is equivalent to 

(10.9) P ik (t) = \i f e- Xi( ‘- T) P i+1 , h (r) dr (k > i). 

Jo 

This relation is readily verified for k = i + 1, and the general proof then pro¬ 
ceeds by induction. Suppose that (10.9) is correct for k < i -f r, and let k = i + r. 
Then the terra P%,k—i(j) occurring under the integral in (10.8) can be expressed by 
means of (10.9) (where k y t , r are to be replaced by k — 1, r, x respectively). If 
in the resulting double integral the order of integrations is reversed, we get, put¬ 
ting r — x = y, 

(10.10) Piled) = X; fV** dx ( \t-ie _X * (,-I_ ‘' ) P i+ i.)fe-i(j/) dy. 

Jo Jo 

The inner integral equals Pi+i,k(t — x)> and thus (10.10) reduces to (10.9) with r 
replaced by t — x. This proves (10.9). 

Now put 

(10.11) mi) = i - E Pikd). 

k=i 

From section 4 we know that (10.3) implies Li{t) > 0, at least for large t. Hence 
Li{t) does not vanish identically, and Lj(0) = 0. We now show that L»(0) satisfies 
the equations (10.5). For that purpose we introduce (10.9) into (10.11) and remem¬ 
ber that Pait) — e~^K We see then that 

(10.12) Li(t) = X, fV Xi( '“ T) L i+1 (r) dr, 
and by differentiation we find that the Li(t) satisfy (10.5). 

11. Problems for Solution 

1. In the pure birth process defined by (3.2) let X n > 0 for all n. Prove that for 
every fixed n > 1 the function P n (t) first increases, then decreases to 0. If t n is the 
place of the maximum, then t\ < k < h < ... 

Hint: Use induction; differentiate (3.2). 

2. Continuation . Show that t n —> co. 

Hint: If t n —> r, then for fixed t> t the sequence \nP n (t) increases. Use (4.10). 

3. The Yule process. Derive the mean and the variance of the distribution 
defined by (3.4). [Use only the differential equations, not the explicit form (3.5).] 

4. Pure death process. Find the differential equations of a process of the Yule 
type with transitions only from E n to E n -\. Find the distribution P n (0> its mean, 
and its variance, assuming that the initial state is i. 



17.11] 


PROBLEMS FOR SOLUTION 


5. The Polya process .* This is a non-stationary pure birth process with X» 
depending on time: 

(11.1) K(t) = 

1 + at 


Show that the solution with initial condition P 0 (0) = 1 is 
Po(t) = (1 + at)~ lla 

( 11 . 2 ) 

p„«) = < B (i + 2a ) .1 ' • n + (” - i) q i t 

n! 


Show from the differential equations that the mean and variance are t and t(l + at), 
respectively. 

6. Continuation. The Polya process can he obtained by a passage to the limit 
from the Polya urn scheme of Chapter 5, example (2.6). If the state of the system 
is defined as the number of red balls, then the transition probability Ek —> Ek +1 
at the n + 1st drawing is 


(11.3) 


Vk.n 


r + kc p -f ky 
r + b + nc " 1+ ny 


where p = r/(r -f b), y = c/(r + b). 

As in the passage from Bernoulli trials to the Poisson distribution, let draw¬ 
ings be made at the rate of one in time h and let h —> 0, n —> oo so that np —> t, 
ny —> at. Show that in the limit (11.3) leads to (11.1). Show also that the Polya 
distribution (2.5) of Chapter 5 passes into (11.2). 

7. Linear growth. If in the process defined by (5.8) X = g, and Pi(0) = 1, then 


(11.4) 


PM =rh' PM 


( M)'— 1 
(1 + At)’* +I ' 


The probability of ult imate extinction is 1. 

8. Continuation. Assuming a trial solution to (5.8) of the form P n (t) 
— A(t)B n (t), prove that the solution with Pi(0) = 1 is 


U1.5) 

Pod) = »Bd), Pnd) = 

jl -XB(i)111 - »Bd)\\\Bd)} 

with 

(ll.fi) 

Bd) 

ju — Ao< x -"> ( 


0. Continuation. The generating function P{s, t) = 2P n (£)s n satisfies the partial 
differential equation 

r) P d P 

(11.7) — = 1 m-(X + m)s + Xs 2 )-~-. 

dt dS 

10. Continuation. Let M 2 (t) — 2?rP„(f) and M{t) = 2nP’ w (0 (as in section 5). 
Show that 

(11.8) M 2 \t) = 2(X - p)M 2 (t) + (X + 

25 O. Lundberg, On Random Processes and Their Applications to Sickness and 
Accident Statistics , Uppsala, 1940. In this book many properties of the Polya 
process are discussed mainly in relation to compound Poisson processes. 



396 STOCHASTIC PROCESSES V I 1711 

Deduce that when X > the variance of jP n } is given by 

(11.0) 6 2 (X-m)<| 1 _ 6 0*-X)«|( X -f M )/(\ - n). 

11. For the process defined by (7.2) the generating function P($, t) « 5JP n $« w 
satisfies the partial differential equation 

(1U0) £ 

Its solution is 


P = g-X(l-«)(l-e - (1 _ 

For i «■ 0 this is a Poisson distribution with parameter X(1 — e“ Ml )/M* As t -+ <», 
the distribution (P w (0} tends to a Poisson distribution with parameter \/p. 

12. For the process defined by (7.26) the generating function P(s , t ) = 2P n (Qs n 
satisfies the partial differential equation 

(m + Xs) — = aXP, 
os 

with the solution P = {(ju + Xs)/(X + m) ! a . 

13. Show that the transition probabilities of the pure birth process and of the 
birth and death process satisfy the Chapman-Kolmogorov equations. 

14. Consider a stationary process with finitely many states, that is, suppose that 
the system of differential equations (9.9) is finite and that the coefficients Cj and 
Pik are constants. Prove that the solutions are linear combinations of exponential 
terms e X(< ~ r) where the real part of X is negative unless X = 0. 



ANSWERS TO PROBLEMS 


CHAPTER 1 


1. (a) 3/5; (b) 3/5; (c) 3/10. 

2. The space contains the two points H1I and TT with probability 1/4; the two 
points II TT and TIM with probability 1/8; and generally two points with prob¬ 
ability 2~ n when n > 2. These probabilities add to 1, so that there is no necessity 
to consider the possibility of an unending sequence of tosses. The required proba¬ 
bilities are 15/16 and 2/3, respectively. 

3. Pr\AB\ = 1/6, Pr\A U B\ = 23/36, Pr\AE\ = 1/3. 

6. x =* 0 in the events (a), (b), and (</). 
x = 1 in the events (c) and (/). 
x 2 in the event (d). 
a: = 4 in the event (c). 

9. (a) A; (6) AB; (c) B U AC. 

10. Correct are (c), (d), (c), (/), (/i), (;), (A), (0. The statement (a) is meaningless 
unless C CZ B. It is in general false even in this case, but is correct in the special 
case CCfl, AC = 0. The statement (b) is correct if C 3 AB. The statement 
(g) should read (A U B) — A — A B. Finally (k) is the correct version of (j). 

11. (a) A EE; ( b) AB E; (c) ABC; (d) A U B U C; (e) AB U AC U BC; 

(f) ABEU3BEUIEC; 

(g) A B E U A E C U I B C = (AB U AC U BC) - ABC ; (A) ISC, (i) ABC. 

12. A U B U C = A U (B - AB) U |C - C(A U B)| = A U Bl U CIS. 


CHAPTER 2 

1. (o) 26 3 ; (6) 26 s + 26 3 = 18,252; (c) 26 2 + 26 3 + 26 4 . In a city with 20,000 
inhabitants either some people have the same set of initials or at least 1748 people 
have more than three initials. 

2. 64-14 = 896. For a chess board with n 2 fields the formula is n\2n — 2). 

3. 2(2 6 7 8 * 10 - 1) - 2046. 


4. 



n(n + 1) 
2 


6. (a) -; Q>) - 1 

n n(n — 1) 


6. pi « 0.01, p2 - 0.27, p 3 = 0.72. 

7. pi - 0.001, p 2 « 0.063; p 3 = 0.432, p 4 - 0.504. 

8. p r «=* (10) r 10'” r . For example, p 3 = 0.72, pio = 0.00036288. Stirling’s 

approximation (7.7) gives pio = 0.0003598— 

397 



ANSWERS TO PROBLEMS 


9. (a) (9/10)*; (6) (9/10)*; (c) (8/10)*; (d) 2(9/10)* - (8/10)*; (e) AB and A U B. 

(ate© ©'• 

11. (o) 1/1 -3-5 • • • (2 n - 1) - 2”n!/(2n)!; (6) (n!)/l-3 ■ • • (2n - 1) = 2 n /( 2 ”) • 


12. On the assumption of randomness the probability that all of 12 tickets come 
cither on Tuesdays or Thursdays is (2/7) 12 = 0.0000003 ... . There are only 


Q = 21 pairs of days, so that the probability remains extremely small even for 


any two days. Hence it is reasonable to assume that the police have a system- 

13. Assuming randomness, the probability of the event is (6/7) 12 = 1/6 appr. 
No safe conclusion is possible. 


14. (a) 


(100 - r ) n 
( 100 ) n “ 



For n = r == 3 the probabilities are (a) 0.911812 ... and (b) 0.912673 .... For 
n ■* r — 10 they are (a) 0.330476 ... and (6) 0.348678 .... 

15. Cf. problem 14 with n = r = 10. 

16. 25!(5!) 5 5~ 25 = 0.00209 .... 


17. 


2(n — 2) r (n — r — 1)! 2 (n — r — 1) 


n! 


»(« — 1 ) 


18. (a) 1/216; (6) 83/3888. 

19. The probabilities are 1 — (5/6) 4 = 0.517747 ... and 

0.491404 _ 

20. (5/6)* < 1/3 or x > 7. 

21. 121/12 12 = 0.000054. 

22. (2 6 - 2)12~ c = 0.00137_ 

**©*♦©■ 


Jr —4 


1 - (35/30) 24 = 



(?)(„-*) 

— - 



ANSWERS TO PROBLEMS 


399 


29. p 


Q(u-.)OQ 0 (.3- d 

O 


QOO 

30. Cf. problem 29. The probability is 

/13\ / 39 \ /13 - m\ /26 + rov /52\ /39\ 

\TO/\l3-?rtA n ) \ 13 - n/ T \13 /\13A 


32. Pa,b,c,d 


/13\ / 39 \ /13 - a\ /26 + a\ /13 - a - 6\ /13 + a -f b\ 

\a/\13 — a/\ b /V13-6/V c )\ 13 - c ) 


QOO 

33. (a) 24p(5, 4, 3, 1); (b) 4p(4, 4, 4, 1); (c) 12p(4, 4, 3, 2). 

aooo 


34 


o 


■. [Cf. problem 33 for the probability that the hand 


contains a cards of some suit, b of another, etc.] 
<48\ /52 


= 0.010 564 .... 

= .164 802.... 

= .134 838.... 


-.o*® 

**- 00*00 

„8,1,1,0)-4.3.12QQQ 

♦©©© 

*>• '•>•»-« O © O * Q O © 

<«> 0 C; *) c © r* + *) + © ("; 0=* * 


.584 298 .... 


37, Cf. problem 31. 



400 

ANSWERS TO PROBLEMS 


38. Let q = l/(^) 

. The probabilities are: (a) 4 q « 

1/649,740; (6) 36g - 

3/216,580; (fc) 13-12-4? = 1/4165; (d) 13-12-4-6$ « 6/4165; (e) 4 q- 

33/16,660; (/) 9-4 5 g = 768/216,580; ( g) 13 4 2 g = 88/4165; 

« O2 3 ) 11= 

198/4165; (i) 13 ) • 6-4 3 ? - 1760/4165. 

39. Pr{(7)| 

= 10 10- 7 

= 0.000 001. 

Pr ((6,1)) 

- io ' 9 0 ■ ir ’ 

- .000 063. 

Pr 1(6, 2)| 

- io ' 9 (D - io ~’ 

- .000 189. 

Pr|(5,1,1)) 

-“(DC) 210 -’ 

= .001512. 

Pr 1(4, 3)) 

-iodQ.'o-’ 

- .000 315. 

Pr{(4,2,l)l 

- ,o - 98 G)©" r ’ 

= .007 560. 

Pr{(4,1, 1,1)1 

- io (X) 32 ' 10 " 

- .017 640. 

Pr((3,3,l)| 

-O 8 G )©•“-’ 

- .005 040. 

Pr{(3,2,2)} 

-■»©©©•-’ 

- .007 560. 

Pr {(3, 2,1,1)| 

-“-©(DO*- 

- .105 840. 

Pr ((3,1,1,1,1)| 

“ 10 C) (D 4 ' 3 ' 210-7 

- .105 840. 

Pr|(2,2,2,1)) 

-(>©©©•'»- 

7 = .052 920. 

Pr ((2, 2,1,1,1)} 

-QOG)© 3210 

“ 7 - .317 520. 

Pr{(2,1,1,1,1,1)) = 10 Q 5-4 3-210- 7 

* .317 520. 

Pr((l, 1,1,1,1,1 

, 1)( » 7-6-5-4-3-210- 7 

- .060 480. 

40. pi(r) = ( r f 9 ig ) 

'*(?)■ 



'♦(?)• 


^“(r-So) 

-n- 




ANSWERS TO PROBLEMS 


401 


r 

PiW 

p 2 (r) 

PsW 

52 

1 

1 

1 

51 

0.75 

0.5 

0.25 

50 

.558 82 

.245 10 

.058 82 

49 

.413 53 

.117 65 

.012 94 

48 

.303 82 

.055 22 

.002 64 

47 

.221 53 

.025 31 

.000 50 

46 

.160 26 

.01131 

.000 08 

45 

.114 97 

.004 92 

.000 01 

44 

.081 76 

.002 08 


43 

.057 60 

.000 85 


42 

.040 19 

.000 34 


41 

.027 75 

.000 13 


40 

.018 95 

.000 05 


39 

.012 79 

.000 02 


38 

.008 53 

.000 005 


37 

.005 61 

.000 002 



CHAPTER 3 


1 . 

2 . 


r 

c 


+ 


n - 1 
r x 


)( 


r 2 -h n - 
r 2 



*)• 


3 (n + r 2 Hh rg)! 
rifoW 

4. «o * n!n“ n ~ (2-irn)^e~ n and <*i == n(n — l)«o/2. 

5. - r l _ n~ T . 

kilk 2 l • • • kn\ 

10. Use lemma 2 of section 1. 

11. For Tk , Pfc use, resj>ectively, formulas (6.6), (9.14), (9.5) of Chapter 2. 

12. Select the v alphas and the single beta which must follow it. The remaining 
elements can be arranged in (ri — v + r 2 — 1)! ways. 

13. Follows from problems 12 for v = 0. 

14. Consider Piy/P^-i. 

15. The hint and the argument used to prove the theorem show that 




Use formula (6.5) of Chapter 2 twice. 

16. The alpha runs can be put in arbitrary order between, preceding, or follow- 

( r 2 “H 1\ 

J selections for the places of the alpha 

runs. The first factor gives the number of ways in which the k places can be assigned 
to runs of different lengths. 



402 


ANSWERS TO PROBLEMS 


CHAPTER 4 


1. 99/323. 

2 . 0.21 .... 

3. 1/4. 

4. 7/2®. 

5. 1/81 and 31/6®. 

6. If Ak is the event that (k, k) does not appear, then from (1.5) 



7. u r 




9. 


Vr = 


N 

E 


fc-0 



(ft - fc)r 

Wr 


10. The general term is* aut^fe * * * a Nk N , where (Ja, & 2 , • • •, Icn) is a permuta¬ 
tion of(l, 2, • • •, N). For a diagonal element k v — v. 

12. Note that, by definition, u r = 0 for r <n and u n = ii\s n /(ns) n . 


14. U r — W r _ 1 


E (-D *- 1 

k~l 


( n — 1\ (ns — A-.s) r _i 
A; — 1/ (ns — l)r —i 




n—1 

£ (-U*l 

k=0 

(“:')(■ 

15. r 

Qi(r) 

Ql{r) 

Qa(r) 

51 

0 

0 


50 

0 

0.76471 

0.23529 

49 

0.39765 

.55059 

.05177 

48 

.58430 

.29964 

.01056 

47 

.58836 

.14592 

.00198 

46 

.50634 

.06684 

.00034 

45 

.40102 

.02935 

.00005 

44 

.30214 

.01243 

.00001 

43 

.22021 

.00509 


42 

.15672 

.00201 


41 

.10946 

.00077 


40 

.07524 

.00028 


39 

.05097 

.00010 


“•(T Opj-^OO'- 


17. Use ( 5 g)-S* - 

ocn 

• 



P[oi ** 0.264, P[i] * 0.588, P&] = 0.146, P[ 3 ] ■» 0.002, approximately. 




ANSWERS TO PROBLEMS 


403 




P[0] = 0.780217, P[1] = 0.204G06, P [2 ] = 0.014845, 
P( 3 ] = 0.000330, P[ 4 ] = 0.000002, approximately. 

N—m 

19. m\N\um = E (-D*(iV - to - k)\/kl 

k =0 

20. Cf. the following formula with r = 2. 


21. (rN)lx - ) r\rN - 2)! - ^ ) r\rN - 3)! +- 

ve liai 

o 


+ (-1 ) K r N (rN - AT)!. 
23. For n > r we have Pi = 1. In (7.1) let k — n — v — 1. 


26. P[ m ] = 




s <-■>*(” nc 


n — wi + r — 1 — k 




27. Use (0.11) and (9.1) of Chapter 2. 


CHAPTER 6 


1 . 1 


(5)3 

(0)3 


2. p — \ — 

3. 0.41. . .. 


_ 1 
“ 2 * 

10 - 5 9 
^uTZ'llo 


0.61 .... 



6 125 140 80 
b * 345 ; 345'345* 

7. 1 - p\ 


10 . 


2 


11. p 



Pl)(l - P2> • • • (1 - ?»)• 



404 


ANSWERS TO PROBLEMS 


12. Use 1 — x < e~ x for 0 < x < 1 or Taylor’s series for log (1 — a;); cf. (6.9) of 
Chapter 2. 


5 + 

b +c + 


16. If the statement is true for the nth drawing regardless of 6, r, and c, then the 
probability of red at the (n + l)st trial is 


b . 6 + c + _L_. b _ b 

6 + r 6 + r + c b + r 6 + r + c 6 + r 


17. The preceding problem states that the assertion is true for m =* 1 and all n. 
For induction, consider the two possibilities at the first trial. 

20. From (5.2) 2v - 2p(l - p) < 1/2. 

22. (a) u 2 ; (6) n 2 + uv -f v 2 /4) (c) u 2 -f (25 uv + 9 v 2 + vw + 2uw)/16. 

27. pll = p32 = 2p21 = V, Pl2 " P33 * 2^23 = 9, Pl3 = P31 = 0, ?22 385 1/2. 


CHAPTER 6 

1. 5/16. 

2. The probability is 0.02804 - 

3. (0.9)* < 0.1, x > 22. 

)* 

x > 66, respectively. 

5. 1 - (0.8) 10 - 2(0.8) 9 = 0.6242 .... 

6. {1 - (0.8) 10 - 2(0.8) 9 )/{l - (0.8) 10 ) = 0.6993. 

'• (DO 

8. (6-®-212-«|. 

9. True values: 0.6651 ..., 0.40187 ..., and 0.2009 ...; Poisson approxima¬ 
tions: 1 — e~ l = 0.6321 ..., 0.3679 ..., and 0.1839 _ 

10 . e~ 2 f; 2 k /k\ = 0.143 .... 

4 

11. e _1 f) 1/*! = 0.080 .... 

3 


( 52\ 

J . Hence x > 263 and 


4. <f < 1/2 and (4 q) x < 1/2 with p 


12. e~*t m < 0.05 or z > 300. 

13. e~ l - 0.3679 ..., 1 - 2 c~ f - 0.264 .... 

14. e~ x < 0.01, x > 5. 

15. 1/p = 649,740. 

1®* 2^ 2” 2n “ ( n ) 2~ 2n « for ^ ar K e 71 [cf- (9-® of Chapter 2]. 



This can be written in the alternative form 



ANSWERS TO PROBLEMS 


405 


6—1 / a «|_ fa _ 

p a 22 ( ) q k , where the fcth term equals the probability that the ath 

fc*0 \ K / 

success occurs directly after k < b — 1 failures. 


a + k — 1> 


19. The successive terms decrease faster than a geometric sequence with ratio 
(n — k)p/kq when k > np and kq/(n — k)p when k < np. 

20. Use 1 — x < e~ x for 0 < x < 1. 

21. Use the obvious symmetry between successes and failures. 

30. p - p\q%/{pm + pm)- 


CHAPTER 7 


1. Proceed as in section 1. 

2. Use (1.7). 

3. Write the integral in the form 

e~* i/2 f'\e TV + e~ xv )e~ vt,i dy. 
JQ 


4. 0.99. 

5. 500. 

6. 66,400. 

7. Most certainly. The inequalities of Chapter 6 suffice to show that an excess 
of more than 8 standard deviations is exceedingly improbable. 

8. (27rn)“ 1 lPiP2(l - Pi - P2)!~^. 

CHAPTER 8 


1 . 0 - 21 . 

2. x — pu + QV + rw, where u , v, w are solutions of 


u — p a 1 


1 - p a ~ l 

+ (qv + rw) — -, 

1 - P 


V — (pu -f TW) 



3. u 


v 


w 


w = pu -f* qv + rw = x* 


p «~ 1 


1 - p ®- 1 

-f (qv + rw) -- 

1 - p 


(pu + rw) 


(pu + qv) 


1 - q 
1 - rT~ l 


4. Note that Pr{A n | < (2p) n , but 

PrM„| > 1 - (1 - p") 2 ”'' 2 ' 1 > 1 - e ~< 2 ' ,) " /2n . 

If V - 1/2, the last quantity is ~l/2»; if p > 1, then Pr[A n ) does not even tend 
to zero. 



406 


ANSWERS TO PROBLEMS 


CHAPTER 9 

1. In the joint distribution of X, Y the rows are 32“ l times (1, 0, 0, 0, 0, 0), 
(0, 5, 4, 3, 2, 1), (0, 0, 6, 6, 3, 0), (0, 0, 0, 1, 0, 0); of X , Z: (1, 0, 0, 0, 0, 0), (0, 5, 6, 
1, 0, 0), (0, 0, 4, 6, 1, 0), (0, 0, 0, 3, 2, 0), (0, 0, 0, 0, 2, 0), (0, 0, 0, 0, 0, 1); of Y y Z: 
(1, 0, 0, 0), (0, 5, 6, 1), (0, 4, 7, 0), (0, 3, 2, 0), (0, 2, 0, 0), (0, 1, 0, 0). Distribution 
of X + Y: (1, 0, 5, 4, 9, 8, 5) all divided by 32, and the values of X + Y ranging 
from 0 to 6; of XY: (1, 5, 4, 3, 8, 1, 6, 0, 3, 1) all divided by 32, the values ranging 
from 0 to 9. E(X) - 5/2, E(Y) = 3/2, E(Z) = 31/16, Var(Z) = 5/4, Var(7) 
= 3/8, Var(Z) = 303/256. 

2. 364 n- *365 1-n . 


3. (a) 365|1 - 364 n -365 _B - 7i364 n ~ 1 -365- n !; (6) n > 28. 

4. (a) n = n, a 3 = (n — 1 )n; (6) n = (ra + l)/2, a 2 = (n 2 — 1)/12. 

5. — n/36. 

» 2 nNt 

6. a 1 «---- 

(n + l) 2 (n + 2) 

7. Pr\X <r,Y>s) = ( — - 1±* )"; 

Pr[X - r, Y = s) = jV~ n ((r - s + 1)" - 2(r - »)" + (r - « - 1)"| 

T n 2 - (r - 1)"~ 2 . 

8. x =---r———— if j < r, k < r. 


r n - (r - l) n 
r B ~ 2 

X ~ r” - (r - l) n 
2=0 


if j < r 9 k =* r, or j = r, k < r. 


if j > r or A; > r. 


9. Pk = -E(Z) * P7 1 + qp *; Var(Z) = pq~~ 2 + qp 2 - 2. 

10. qk — pV 5 ” 1 + Pr{X — m, Y = n) = p m+l q n -f q mJtl p n with m, n, 

> 1; B(Y) = 2;<r 2 = 2(pg~ 1 + qp~ 1 - 1). 

11. The distribution is given in Chapter 2, (5.8); the means and variances in 


rn i 

example (5.c). We have E(X) = — ; Var(Z) = 

n 

—rain 2 (w — r) 


rni(7i — ni)(n — r) 


'A 


n 2 (n — 1) 


; Cov(Z, Y) 


.=(.-1) it<J ^--( (n-„0h- J ' 

fin(n - i)(rj 4- l) 


12. £(*) = ri(ri , + 1) ; Var(X) 


13. J?(S„) 


ri + r 2 
nb 


b -h r 


; Var(S n ) 


(ri H- r 2 — l)(n + r 2 ) 2 

n6r{6 + r -f nc} 

(fr + r) 1 (6 + r+ c)’ 


14. £(r,) - £ 


at 


Jfe~i r — A: + 1 


. Yartr) -^ ^cy-r + fc-i) 
’ (r) « (r - * + D 2 



ANSWERS TO PROBLEMS 


407 


15 ' E (a r + l) ” ^ kpk ' n ^ n + !) “ flVff' (i - ^Ti) + 99')" -1 


m 99' _ gV9' ■ 1 

1 — gp' (1 - gp') 2 ° 5 6 * 8 9P' 

E(AO = ^ ; fi’(N) = —; Cov(A, N) 
p qp 


P(K, N) = 


(1 


X_l 

- 9P) I 


M 



18. (o) 1 - g*; ( 6 ) E{X) - A (l - g* + ij ; ( c ) = 0. 


'& < ~ 1) *“'r=i(s)* + (?W 


To derive the last formula from the first, put/(g) = rZfc -" 1 ^ 9 *. Using 

(9.1) of Chapter 2 , one finds that/'(g) = rq r ~ I (1 — g)~ r . The assertion now fol¬ 
lows by repeated integrations by part. 


CHAPTER 11 

1. sP(s) and P(s 2 ). 

2 . (a) (1 — s)“ 1 P(«); ( 1 b ) (1 -•)->«P(«); (c) (1 - »P(«)}/(1 - s); (d) Po 8~' 
+ U - s~ l P(s)|/(1 - a); (e) MIP(^) + P(-^)|. 

3. l/($) = pqs 2 /{ 1 — p$)(l — qs ). Mean = 1 /pg, Var = (1 — 3 pq)/p 2 q 2 . 


4. U( 8 ) = 


2(1 - s) 2(1 - (g - p)«| 


, 2 u n = 1 -f (g - p) n . 


5. Using the generating function for the geometric distribution of X p we have 
without computation 


Pr(s) = 3 r 


/N - l\ / N -2\ / N - r + 1 \ 

\N - s) \N - 2 a) ’ ‘ ‘ VST- (r - l)s/ 


6 . From (9.1) P r (s)\N - (r - l)s) - P r -i{s)(N - r - 1 ) 5 . 

8 2s rs 


7 . Pr(s) 


N -(N - 1)8 N — (N — 2)3 


N - (N - r)s 


8 . S r is the sum of r independent variables with a common geometric distribu¬ 
tion. Hence 




ANSWERS TO PROBLEMS 


408 


9. Pr\R - r} - £ Pr(S r _i = A;} Pr[X r > v - k\ 


k -o 

9-1 


grvf + :-v-^c:i I 2 )- 


B(fi) = l+^, Var(R)-4 
P P 


« /jt _ n 

14. ?/,* - g n + 22 ^ 2 j p 3 g*~" 3 u n -* with ^ = l, wi * g, ^2 - g 2 , 


W3 


p 8 + 5 s . Using the fact that this recurrence relation is of the convolution type, 
U(s) 


1 I (p * )3 T70r) 

+ t;- U{8). 


1 — qs (1 — g«) 3 


15. Un = pt^n-1 + gWn-1, t>n * P^n-l + g*>n-l, Wn = P*>n-1 + V^n-l. Hence 
t/(s) - 1 * psW(s) + qsU(s); V(s) - psU(s) + gs-VM; W(s) - psFM + gsW(s). 

20. From (6.2), P n+1 *(l) - P'(l)Pn' 2 (l) + P'(l)Pn"(l). Putting X 2 - Var(Xi), 
this equation becomes Var(AT n+ i) = /x 2w X 2 + m Var(AT w ) and hence Var(AT n ) = 
XV*- 2 + » 2 "- s H-h M n_1 ). 

If p « 1, then Var(Af n ) = no 2 ; otherwise Var(AT n ) = X 2 M n_1 Gu n — l)/0* — !)• 


CHAPTER 12 

1. It suffices to show that for all roots s ^ 1 of F(s) = 1 we have | s | >1, and 
that | 8 | « 1 is possible only in the periodic case. 

2. U2n = ~( 7rn )~ r/2 * Hence 8 is certain only for r » 2. For 

r ** 3 the tangent rule for numerical integration gives 



Hence by (3.7) the probability of 8 ever occurring is, approximately, x » 1/3. A 
more precise evaluation of the sum is 0.47 and leads to x = 0.32. 

3. The zero preceding the first negative value of S n may be the first, second, 
etc., zero. Hence the required generating function is 


5 {i+5*»+**■»+• 


PM 


PM # 

8 


6 . Note that 1 — P(s) = (1 — s)Q(s) and n — QM ® (1 — s)P(«), whence 
Q(l) — /*, 2R(1) * a 2 — p + p 2 . The power series for 0”" X M ** 2(t in — w n -i)s n 
converges for s «* 1. 

CHAPTER 13 


1. N n * « (Nn - 714.3)/22.75; *($ - *(-f) « 

4. If a n is the probability that an A-run of length r occurs at the nth trial, then 
AM is given by (1.5) with p replaced by a and q by 1 — a. Let B(s) and CM be 
the corresponding functions for B- and C-runs. The required generating f un ctions 



ANSWERS TO PROBLEMS 409 

are F(s) ■* 1 — C7“’ 1 (s), where in case (a) U(s) — A(s); in (6) C7(s) «= A{s) -f B(s) 
- 1; in (c) U« - A(s) + B(s) + C(s) - 2. 

5. Use a straightforward combination of the method in example (2.6) and 
problem 4. 

9. u n = Np> Vk(oo) = JVpg* 


CHAPTER 16 

1 . (a) The chain is irreducible and ergodic; —► 1/3 for all j t k. (Note 

that P is doubly stochastic.) ( 6 ) The chain has period 3, with G\ containing E\ and 
Ei) the state E\ forms G<i, and Ez forms Gz- We have u\ = = 1/2, w 3 = w 4 =* 1. 

(c) The states E\ and Ez form a closed set Si, and E\ f Ez another closed set $ 2 > 
while Ei is transient. The matrices corresponding to the closed sets are two by 
two matrices with elements 1 / 2 . Hence Pjk^ —» 1/2 if Ej and Ek belong to the 
same S r ) p/ 2 (w) —* 0 ; finally p 2 fc (w) — > 1/2 if k =* 1 , 3, and P 2 fc (n) —> 0 if k — 2 , 4, 5 . 

2. - 0*/6) n , Vik {n) = (/c/ 6 )” - ((/b - l)/ 6 ) n if /b > j, and p 3 * (w) - 0 if 
k ^ 

3. % - (3/4, 1/2, 1/4, 1/2), y k = (1/4, 1/2, 3/4, 1/2). 

4. p /f - 2j(AT -j)/N*,p H+l = (AT - j)Wp M _1 =]VN*,u k = (*)% ( 2 *). 

10 . Note that the matrix is doubly stochastic and use example ( 6 .d). 

15. Let M be the maximum of Xj. Consider the states E r for which x r = Af. 

18. If iV > m — 2, the variables X (m) and X (n) are independent, and hence the 
three rows of the matrix pjk^ m,n) are identical with the distribution of Af (n) , namely, 
(1/4, 1/2, 1/4). For n = m -f 1 the three rows are (1/2, 1/2, 0), (1/4, 1/2, 1/4), 
( 0 , 1 / 2 , 1 / 2 ). 


CHAPTER 17 

3. E(X) - ie xt ; Var(AT) = ie>\e u - 1). 

4. Pn = — XnPn + X(n + 1)Pn+1* 

p n = Q) «"**(«" - l) 1 '-" (n z i). 

E(X) = ie~ u ; Vnr(X) = ie- x ‘(l - e~ xt ). 

14. The standard method of solution leads to a system of linear equations. Use 
the hint to problem 15 of Chapter 15. 




INDEX 


Absorption: birth and death process 373; 
diffusion 296, 304; Markov chains 
332, 360, 373; random walk 279 (sev¬ 
eral dimensions 299); sequential sam¬ 
pling 281, 300. 

Acceptance 281, 300, 314. 

Accidents 117, 189, 234, 365 (distr. of 
damage 222). 

Adler, H. A., and K. W. Miller 384. 

Age distribution 276 (stable 278). 

Ages of a couple 12, 14. 

Aggregates, self-renewing 275. 

Andersen, E. Sparre 252. 

Approximation: binomial distr. by nor¬ 
mal (individual terms 135, 146; cen¬ 
tral part 137, tails 144, 147), binomial 
distr. by Poisson 110, 125 (error esti¬ 
mate 115); birthday distr. 29, 72; 
Bonferoni’s inequalities 75, 101; hy¬ 
pergeometric distr. (by binomial 47, 
108, 125, by Poisson 114, by normal 
146); multinomial by Poisson 127; n! 
43, 50; normal distr. (tails) 131, 145; 
by partial fraction expansions 229,237 
(numerical examples 230, 267, 276; 
error estimate 269); sequential sam¬ 
pling (== generalized random walk) 
303. [Of. Limit Theorems.] 

Arc sine law 252, 257; counterpart 
262. 

Assignable causes 56. 

Atomic bombs 223. 

Average of distr. 172. 

Averages, moving 339, 340, 346. 

b(k; n, p) 106. 

Bachelier, L. 293. 

Backward equations 385, 390, 392. 

Bacteria counts 122. 

Banach*8 match-box problem 108, 176. 

Barriers 279, 280 (in several dimensions 
299, 345). 


Bartky, W. 281, 314. 

Bayes*s rule 85. 

Bernoulli, D. 199. 

Bernoulli, J. 104. 

Bernoulli trials 104; (billiards 235, 
236); gambling systems 151; infinite 
sequences 149, 282; iterated logarithm 
157,163; multiple 125, 127, 168; num¬ 
ber theoretical interpretation 161; re¬ 
turn to equilibrium 244, 305; (even 
number of successes 235). [Cf. Bino¬ 
mial distribution, Coin tossing , Ran¬ 
dom walk , Recurrent events , Ruin , 
Runs.] 

Bernstein, S. 88, 140. 

Beta function 127. 

Betting, on runs 149, 268, 278; — sys¬ 
tems 151, 282. [Cf. Coin tossing , 
Duration of games , Games, Ruin.] 

Billiards 235, 236. 

Bingo 47, 76. 

Binomial coefficients 30, 40; identities 
with 47, 76; integrals for 292, 305. 

Binomial distribution 106; as beta func¬ 
tion 127; central term 109; combina¬ 
tion with Poisson 128, 221, 344; con¬ 
volutions 126, 216; expectation 173, 
216 (absolute 189); generating fct. 
216; — and liypergeometric 47, 108, 
125; negative — 218; normal approxi¬ 
mation (individual terms 135, 146; 
central part 137, 139; tails 144, 147); 
in number theory 161; in occupancy 
problems 55, 69; Poisson approxima¬ 
tion 110, 125 (error estimate 115; 
comparison with normal 143); tails 
126, 144; variance 178, 180, 216. [Cf. 
Bernoulli trials, Random walk , Ruin.] 

Binomial formula 41. 

Biological applications , cf. Birth and 
death process , Breeding , Chromosomes , 
Genes, Larvae , Renewal, Survival . 


411 



412 


INDEX 


Birth process 367, 394, 396 (divergent 
369, 391). 

Birth arid death process 371, 395. 

Birthdays 29; expectations 174, 187; as 
occupancy problem 52; Poisson distr. 
72, 112; special problems 45, 125. 

Bishop, D. J. 276. 

Blood: counts 122; tests 189. 

Boltzmann-Maxwell statistics 53. 

Bomb hits 120. 

Bonferoni’s inequalities 75, 101. 

Boole’s inequality 20. 

Borel, E. 157, 163; Borel-Cantelli lem¬ 
mas 154, 159, 160, 209. 

Bose-Einstein statistics 53, 59, 77. 

Bottema, O. 235. 

Branching process, cf. Chain reactions. 

Breeding 101; (as Markov chain prob¬ 
lem 315, 345, 360). 

Bridge 9; aces (among r cards 35, 46; 
joint distr. 166); algebra of events 15, 
21; bingo 47, 76; composition of hands 
31, 33, 39, 46, 62, 63, 65, 125, 187; 
conditional prob. 79, 100. [Cf. Shuf¬ 
fling.] 

Brother-sister mating 101; (as Markov 
chain 315, 345, 360). 

Brownian motion 279, 293; Ehrenfest 
model 312, 327, 345. 

Busy hour 365. 

Campbell, N. R. 276. 

Cantelli, F. P. 157 (Borel-Cantelli 
lemmas 154, 159, 160, 209). 

Cantor, G. 17, 233. 

Cards , cf. Bridge , Matching , Poker , Shuf¬ 
fling. 

Catcheside, D. G. 45, 76, 120, 222. 

Census, Bureau of 188. 

Centenarians 113. 

Central force in diffusion 313. 

Central limit theorem: for arbitrary distr. 
202,209; binomial distr. 140; identical 
distr. 192; Markov processes 342; re¬ 
current events 248; runs 266; (infinite 
moment analogue 253; frequency of 
decimals 162; permutations 205). 

Chandrasekhar, S. 345. 

Chain reactions 223, 237. 

Chains (polymer molecules) 190. 


Chapman-Kolmogorov equations for 
Markov chains 338, 341; for stochas¬ 
tic processes 387, 396. 

Characteristic equation 302. 

Characteristic values 350. 

Chebyshev, P. L. 183; — inequality 
183; generalized 189. 

Chess problems 44, 76. 

Chi-square test: mentioned in connection 
with tabular material, but not defined. 

Chromosomes 92, 96; breakages and in¬ 
terchanges 45, 76, 120, 128, 222. 

Chung, K. L. 189, 252. 

Clarke, R. D. 120. 

Classification , multiple 24. 

Closed set (in Markov chains) 318. 

Cochran, W. G. 57. 

Coin tossing: arc sine law 252, 257 
(counterpart 262); distr. of leads 250, 
255; return to equilibrium 238, 245 
(limit theorem 253, 258; random walk 
288, 304); ties in multiple — 246, 261. 

Collector's problems 52, 64, 76, 174; mo¬ 
ments 181, 188. 

Colorblindness 96, 98, 100, 126. 

Combinatorial product space 91. 

Competition problem 139. 

Complementary event 13. 

Composite Markov process (shuffling) 
340. 

Compound distributions 221, 237; bino¬ 
mial and geometric 223; binomial and 
Poisson 128, 221, 344. 

Compound experiments 81. 

Compound Poisson distr. and process 
237, 391, 395. 

Confidence level 142. 

Contagion 56, 83, 128, 223. 

Continuity theorem 232, 278. 

Convergence (almost everywhere and in 
measure) 162, 207. 

Convolutions 215, 236; (binomial distr. 
126; Poisson 127). 

Correlation coefficient 186. 

Cosmic rays 369. 

Counters (waiting lines) 378. 

Coupons , collecting of, 52, 64, 76, 174; 
moments 181, 188. 

Covariance 179. 

Cramer, H. 119. 



INDEX 


413 


Craps 16. 

Cumulative chance effects 251. 

Cumulative distribution 133. 

Cycles 205. 

Cyclical random walk 311, 352, 353. 

Cylindrical sets 91. 

Dahlberg, G. 100. 

Death process 394. 

Decimals , distribution of 161; (e, 29, 
124; 7r, 124). [Cf. Random digits .] 

Defectives 45, 100, 112; blood tests 
189; (Bartky’s sampling scheme 315; 
Dodge’s 168, 188). 

Degenerate processes 369, 391. 

DeMoivhe, A. 133, 212, 236. 

DeMoivre-Laplace limit theorem 137 
(traditional form 139). 

Density fluctuations (Ehrenfest model 
327; particles in space 345). 

Density function 133. 

Derivatives , number of 52. 

Descendants: in breeding 315, 360; in 
chain reactions 224; family relations 
102; genetics 92, 101, 315; renewal 
275. 

Determinants 76. 

Dice , ace runs 150, 163, 266; distr. of 
scores 167, 178 (generating fct. 236); 
equalization of ones, twos, . . . , 
239, 247; normal distr. 146, 192; as 
occupancy problem 51; special prob¬ 
lems 33, 45, 76, 100, 106, 124, 125, 
187, 344; (dice illustrating compound 
experiments 83; pairwise independ¬ 
ence 87). 

Difference , nth, of zero 77. 

Difference equations: method of particu¬ 
lar solutions 283, 288, 302; passage to 
limit of differential eqns. 294, 304; 
several dimensions 299, 306; (for 
Ehrenfest model 327, 345; Polya’s urn 
scheme 101, 395; reflecting barriers 
326). [Cf. Renewal equation.] 

Differential equations, ordinary; back¬ 
ward and forward 385, 389, 392; gen¬ 
erating functions 395; Kolmogorov’s 
— 386; recursive, without uniqueness 
393; special (birth process 367, 390, 
394; birth and death 372, 395; com¬ 


pound Poisson 391; Poisson 366, 386; 
Polya 395; power supply 384, 396; 
radioactive process 368; servicing 
380; trunking 377, 396; waiting lines 
378; Yule process 368, 394). 

Diffusion 279, 293; absorption and first 
passage 296, 304; — coefficients 295; 
Ehrenfest model 313, 327, 345. 

Dirac-Fermi statistics 53. 

Discrete sample spaces 16. 

Disorder and chance fluctuations 189. 

Dispersion 178. 

Distinguishable objects 11, 51. 

Distribution: function 133, 165; normal 
129; probability distr. 165 (joint, 
marginal 166). 

Dodge’s inspection plan 168, 188. 

Doebun, W. 342, 344. 

Dominant gene 92. 

Domino 44. 

Doob, J. L. 152, 344, 392. 

Dorfman, R. 189. 

Double generating functions 255. 

Double sampling 168, 188, 314. 

Doubling system 285. 

Drift: diffusion 295; random walk 279. 

Duration of games: Bernoulli trials 280 
(expectation 286; generating function 
289,305; explicit expressions 292,304); 
Markov chains 335, 358; sequential 
sampling 303, 306. [Cf. Extinction.] 

e (distr. of decimals) 29, 124. 

Eggenberger, F. 83. 

Ehrenfest, P. and T. 313; — model of 
diffusion 312, 345 (stationary distr. 
327). 

Eigenvalue 350. 

E inste in-Bose statistics 53, 59, 77. 

Einstein-Wiener diffusion 293. 

Eisenhart, C., and F. S. Swed 56. 

Elastic: barrier 280; — force in diffusion 
313. 

Elevator problem 30, 47, 52. 

Ellis, R. E. 292. 

Equilibrium: coin tossing 238; Ehren¬ 
fest model 327; macroscopic 329; sta¬ 
tistical 373. 

Erdos, P. 163, 244, 252. 

Ergodic (properties of aperiodic chains 



414 


INDEX 


324; periodic 329; stochastic processes 
373, 396); — states 321; mean — 
theorem 346; non-stochastic matrices 
343. 

Erlang, A. K. 377. 

Error function 133. 

Estimation , statistical 37. 

Events: compatible 60; compound and 
simple 9,13; independent 86 (pairwise 
and mutual 88); relations between — 
13; simultaneous realization 15 (at 
least one 60; m among N 64, 74, 101); 
— in repeated trials 90, 91. [Cf. Re¬ 
current events .] 

Evolution 369. 

Expectation 171; — and generating fcts. 
213; infinite 214 (recurrence times 
242); of reciprocals 188, 189, 190. 

Experiments: conceptual 4, 9; compound 
81; repeated 89; — and random vari¬ 
ables 164. 

Exponential: distribution 220; holding 
times 375. 

Extinction: birth and death process 374, 
395; chain reactions 224; — of genes 
333. 

Extra Sensory Perception (ESP) 45, 336. 

F, for failure 104. 

Factorials 30; Stirling's formula 41, 50. 

Family: relations 102; names, sur¬ 
vival 224; (problems on sex distr. 79, 
81, 86, 100, 125, 222). 

Favorable cases 20, 23. 

Fermi-Dirac statistics 53. 

Fire accidents 189, 234; (damage distr. 

222 ). 

First passages: diffusion 296; Markov 
chains 324; random walk 280 (gener¬ 
ating fct. 290, 305; explicit expression 
292); recurrent events 243. 

Fish catches 37. 

Fisher, R. A. 6, 38, 107, 224, 315. 

Fission 224. 

Flaws in material 118. 

Fokker-Planck equations 295, 296, 304. 

Forward equations 385, 389, 392. 

FrSchet, M. 60, 75, 343, 347, 351. 

Frequency function 133. 

Friedman, B. 313. 


Frobenius' theory of matrices 343. 

Fry, T. C. 107, 377. 

Furry, W. H. 369. 

FOrth, R. 339; (—’s formula for first 
passages 296, 304). 

Galton, F. 204, 224. 

Gambling systems 151, 282. [Cf. Coin 
tossing , Ruin.] 

Games , fair 196, 284, 287; generalized 

— 200; unfavorable 200, 210. [Cf. 
Billiards , Duration of games , Ruin.] 

Gamma function 49. 

Gaussian distribution 133. 

Generating functions 212; of compound 
distr. 223; continuity theorem 232, 
278; of differences 236; use in solving 

. differential equations 395, 396; double 

— 255; for first passages 290, 305; for 
Markov chains 347; moment — 236; 
for recurrent events 243; renewal 272; 
ruin 288, 305; runs 265, 268, 278; se¬ 
quential sampling 302, 306; sums 215; 
tails 219. 

Genes and genotypes 92; distributions 94, 
101, 315, 333; inheritance 92, 204; 
mutations 224, 369; sex-linked 96. 

Genetics , cf. Chromosomes 1 Genes. 

Geometric distribution 174 (generating 
fct. 217; variance 181); composition 
with binomial 223; exponential limit 
220; holding times 219; limit in Bose- 
Einstein statistics 59; special applica¬ 
tions Dodge's plan 168, 188; family 
size 100, 224; mortality distr. 278). 

Goncarov, V. 206. 

Greenwood, J. A. 45, 336. 

Growth 277, 370, 374, 395. 

Guessing 66, 182. 

G umbel, E. J. 113. 

Hardy, G. H. 95, 162; Hardy’s law 95 
(for pairs of genes 102). 

Harris, T. E. 227. 

Hausdorff, F. 157, 162. 

Helly's theorem 233. 

Higher sums 339. 

Holding times , exponential 218, 375. 

Hostinsky, B. 343. 

Hypergeometric distribution 33, 46, 167; 



INDEX 


415 


binomial approximation 47, 108; dou¬ 
ble 39, 187; normal appr. 146; Poisson 
appr. 114; variance and mean 183. 

Images, method of, 304. 

Independent events 86 (pairwise and 
mutually 88); experiments 90; random 
variables 169, 190; trials 88. 

Indistinguishable objects 11 , 51. 

Initials 44. 

Insect survivors 128, 222. 

Intersection of events 13. 

Inversions 205. 

Iterated logarithm for Bernoulli trials 
157; generalized 163; number theo¬ 
retical interpretation 162. 

Kac, M. 45, 252, 313, 358. 

Kakutani, S. 346. 

Kelvin, Lord 304. 

Kendall, D. G. 374. 

Kendall, M. G. and B. Smith 26. 

Key problem 187. 

Khintchine, A. 147, 157, 191. 

Kolmogorov, A. 6, 161, 293, 343, 378, 
390; Chapman-Kolmogorov eqns. 338, 
341, 387, 396; — criterion 207 (con¬ 
verse 211); — differential eqns. 389, 
390; — inequality 184. 

Koopman, B. O. 4. 

Lagrange, J. L. 292. 

Laplace, P. S. 62, 84, 133, 212, 236, 
344; (law of succession 84); DeMoivre 
— limit theorem 137, 139. 

Largest observation 175, 187. 

Larvae 128, 222. 

Latent root 350. 

Law of the arc sine 252, 257; counterpart 
262. 

Law of the iterated logarithm 157, 162, 
163. 

Law of large numbers: Bernoulli trials 
(weak 141; strong 156; number theo¬ 
retical interpretations 161); depend¬ 
ent variables 209; indep. random vari¬ 
ables (identically distr. 191; infinite 
expectation 200; arbitrary variables 
202, 209; strong 207, 210, 211); Mar¬ 
kov processes 342; permutations 205. 


Lawrence, T. E. 6. 

Lea, D. E. 76, 120. 

Leads, distribution of 250. 

Lefthanders 125. 

L£vy, P. 252, 253. 

Limit theorems: arc sine law 252, 262; 
average recurrence times 253; Bose- 
Einstein statistics 59; continuity the¬ 
orem 232, 278; distributions with in¬ 
finite moments 200, 252; geometric 
distr. 220; matching distr. 67, 77; oc¬ 
cupancy 72; Pascal 221, 233; Polya 
128, 395; sampling 76; uniform distr. 
236. [Cf. Approximation, Central 
limit theorem , Law of large numbers, 
Markov chains, Normal approxima¬ 
tion, Steady state.] 

Lindeperg, J. W. 192; — condition 202. 

Little wood, J. E. 162. 

I -i japuno v, A. 192; — condition 209. 

Loss, coefficient of 383. 

Lotka, A. J. 100, 224. 

Lunch-counter example 56, 58. 

Lundberg, O. 395. 

McCrea, W. H. 297, 300. 

Machine servicing 376, 379. 

Malecot, G. 315. 

Margenau, H. 54. 

Marginal distribution 166. 

Markov, A. 192, 307. 

Markov chains 309; absorption 332, 
358; associated with continuous time 
processes 373, 391; classification 320, 
345; closed sets 318; decomposable 

316, 320; ergodic properties 324, 330, 
345; finite 324, 342; general 337; irre¬ 
ducible 318; limit theorems 342; peri¬ 
odic 316, 329, 351; probabilities (ab¬ 
solute 318, initial 309, inverse 341, 
stationary 328, 331, transition 309, 

317, 338); recurrence times 320, 324; 
reversible 342; superposition of 340. 

Markov process 337; with continuous 
time parameter 386; (waiting times 
219). 

Markov property 338, 387. 

Match box problem 108, 176. 

Matching of cards 62, 66; multiple 76; 
variance 181. 



416 


INDEX 


Mating 93,101; (as Markov chain prob¬ 
lem 315, 345, 360). 

Matrix: notation 103, 317; partitioned 
316, 323, 359; stochastic 309 (doubly 
327, 336, 345; canonical decomposi¬ 
tion 347, 350; application to non¬ 
stochastic matrices 343, 359). 

Maximum likelihood 38. 

Maxwell-Boltzmann statistics 53. 

Mean = expectation 171 (in terms of 
generating fct. 213); normal distr. 
133; number of successes 138. 

Measure in product spaces 91; conver¬ 
gence in measure 162, 207. 

Median 36. 

Mendel, G. 92. 

M£k£’s paradox 45. 

Mises, II. von 6, 72, 152, 157, 266, 278. 

Misprints 113, 126. 

Mixed populations 82, 237. 

Molecules, long-chain 190. 

Molina, E. C. 113, 143. 

Moment generating function 236. 

Moments 177. 

Montmort, P. R. 62. 

Mood, A. 146. 

Morse alphabet 44. 

Mortality , cf. Renewal. 

Moving averages 339, 340, 346. 

Multinomial coefficients 32. 

Multinomial distribution 124, 167; (max¬ 
imal term 126, 146). 

MuUiplets 24. 

Murphy, G. M, 54. 

Mutations 224, 368. 

(n) r 25. 

National Bureau of Standards 28, 
130. 

Negative binomial distribution 218. 

Neighbors, unlike 56. 

Neyman, J. 123. 

ATow-Markovian processes 338. 

Normal approximation to binomial distr. 
(individual terms 135, 146, central 
part 437, 139, tails 144, 147); combi¬ 
natorial runs 146; - hypergeometric 
distr. 146; permutations 205; Poisson 
distr. 143, 146, 193; success runs 205. 
[Cf. Central limit theorem .] 


Normal density and distribution 129; 
tables 132; tails 131, 145. 

Normal numbers 163. 

Normalized variables 179. 

Nuclear chain reaction 223. 

Null state 321. 

Number theoretical interpretations 161. 

Occupancy problem 54, 69, 174; limit 
theorem 72; treatment by Markov 
chains 313, 354. 

Optional stopping 140, 190, 197. 

Ornstein, L. S. 293. 

p(k; X) 111. 

t r, decimals of, 124. 

Pairs 23. 

Palm, C. 377, 379, 383. 

Parapsychology 45, 336. 

Parking tickets 45. 

Partial fraction expansions 227, 237; for 
Markov chains 348; numerical exam¬ 
ples 230, 266, 276; recurrent events 
261; renewal 276; success runs 266 
(numerical estimate 269). 

Particles: in chain reactions 223; random 
walk 279; splitting — 365, 369, 374; 
statistics of — 53. 

Particular solutions , method of 283, 289, 
302. 

Partitioning of matrices 316, 323. 

Partitions , combinatorial 30. 

Pascal distribution 174, 217, 237; in 
game of billiards 236; Poisson limits 
221, 233; and Polya distr. 218; recip¬ 
rocal 190; variance 181,217; and wait¬ 
ing times 221. 

Pearson, K. 127, 204. 

Pedestrians as non-Markovian process 
339. 

Periods: Markov chains 316, 321; recur¬ 
rent events 241, 244; renewal theory 
273. 

Permutations 90, 205. 

Petersburg paradox 7, 199, 255. 

Petri plate 122. 

Phase space 12. 

Poisson, S. D. 110. 

Poisson distribution 115; compound 
237, 391 (combined with binomial 



INDEX 


417 


128, 221, 344); convolutions 127, 216; 
empirical examples 111, 119, 125, 225; 
generating fct. 216; holding times 
220; limiting distr. for (binomial 110, 
115; fluctuation 344; hypergeometric 
114; matching 67; multinomial 127; 
occupancy 59, 72; Pascal 221, 233; 
Poisson trials 233; Polya 128; long 
runs 278; traffic problem 340; trunk¬ 
ing 378, 396); mean 174, 216; mul¬ 
tiple 127; normal approximation 143, 
146, 193; spatial 118; time dependent 
117, 364; variance 178 (216). 

Poisson process 364, 386; compound 
376, 391, 395. 

Poisson traffic 376. 

Poisson trials 189, 233. 

Poker 9, 46; special problems 31, 76, 
126. 

Poulard, H. 244. 

Polya, G. 83, 174, 297; — distributions 
128 (mean 188; as Pascal distr. 218; 
limiting forms 128, 210, 395); — proc¬ 
ess, 395; — urn scheme 83, 101 (as 
non-Markovian process 338). 

Polymer molecules 190. 

Population theory , cf. Birth and death 
process y Family, Genes , Renewal. 

Power-supply problems 108, 384, 396. 

Probability 17 (in product spaces 91); 
absolute 79 (for Markov chains 318, 
338); — of causes 85; compound — 
81; conditional 78, (random variables 
168; Markov chains 307; processes 
338, 387); — distributions 164; initial 
307; inverse 341. 

Product measure and space 91. 

Quality control 34, 56, 125; (Bartky’s 
sampling 281, 314; Dodge’s 168, 188). 

Rademacher, H. 6. 

Radiation effects 45, 76, 120, 222. 

Radioactive disintegrations 119, 368. 

Railroad problem 139. 

RaisinSy distribution of 113, 118, 126, 
237. 

Random chains 190. 

Random choice 25. 

Random digits: counts 26, 27, 55 Or and 


e 29, 124); k distinct among n — 27, 
70; distr. (binomial and Poisson 54, 
112; normal 141, 142); frequency of 
decimals 161; as occupancy problem 
54; special problems 44, 125. 

Random mating 93, 101. 

Random variables 164; generalized ( = 
improper) 242; integral valued 212; 
Markovian 337; normalized 179; time- 
dependent 363, 387. 

Random walk 279, 304; absorbing bar¬ 
riers 279 (generating fct. 288, 304, 
explicit expression 292, 304, as Mar¬ 
kov chain 310, 333); cyclical — 311, 
353; diffusion 293, 304; first passages 
280, 292; generalized (= sequential 
sampling) 281, 300, 306, 333; more 
dimensional 281, 297, 306, 345; re¬ 
flecting barriers 280, 304, 305, 311, 
326 (explicit expressions 355, more 
dimensions 345); renewal method 305. 

Randomness: of sequences 157; tests of 
— 56, 68. 

Recessive 92, 98, 102. 

Recurrence times 241; limit theorems 
248, 253; in Markov chains 320, 362; 
mean — 242; moments 262; in ran¬ 
dom walks 280, 297, 305. 

Recurrent events (patterns) 238; classifi¬ 
cation 241; criteria 244. 

Recurrent states 320. 

Reduced number of successes 138. 

Reflecting barriers 280, 304, 305, 311; ex¬ 
plicit solution 355; in plane 345; sta¬ 
tionary distr. 326. 

Rejection 281, 300, 314. 

Rencontre 62. 

Renewal: coefficients 276; equation 272; 
method in random walks 305; of 
populations 275. 

Repairs of machines 379, 382. 

Repeated trials 88 (product spaces 91); 
infinite sequences 149, 282, 307; ran¬ 
dom variables representing — 169. 

Replacements 275. [Cf. Sampling.] 

Retrospective equations = backward 
equations. 

Reversible Markov chains 342. 

Robbins, H. 223. 

Romanovsky, V. 343. 



418 


INDEX 


Ruin problem 279, 282; generating fct. 
288, 292, 304; in Markov chains 332; 
numerical illustration 287; sequential 
sampling 281, 300, 306. 

Runs, combinatorial 56, 59; mean 188; 
normal approximation 146. 

Runs in repeated trials 239, 264; in game 
of billiards 235, 236; generating fct. 
265, 268; as Markov chain problem 
310, 318; normal distr. 265; partial 
fraction method 266 (error estimate 
269, special cases 229, 278); Poisson 
distr. of long runs 278; r successes be¬ 
fore n failures, etc. 149, 163, 268, 278; 
special problems 187, 235. 

Rutherford-Chad wick-Ellis 119. 

S, for success 104. 

Sample average 193. 

Sample point 10, 13. 

Sample size , required 142, 146, 192. 

Sample space 5, 10, 12; discrete 16; in 
terms of random variables 169; re¬ 
peated trials 89 (infinite sequences 
149). 

Sampling with and without replace¬ 
ments 24, 91 (comparison 47, 108); 
distinct elements in samples 63, 76, 
174, 181, 188, 235; of fish 37; inspec¬ 
tion — 34, 125 (Bartky’s scheme 281, 
314; Dodged 168, 188); largest ob¬ 
servation 175, 187; required size 142, 
146, 192; sequential 281, 300, 306, 
313; stratified 188. 

SCHROEDINGER, E. 223. 

Seeds: Poisson distr. 118; survival 224. 

Selection , genetic: 93, 99, 102, 224. 

Selections , combinatorial 30. 

Self-renewing aggregates 275. 

Senator problem 32, 34. 

Sequential sampling 281, 300, 306, 313. 

Servicing factor 380; — problems 377, 
379; (power supply 108, 384, 396). 

Sets, closed (in Markov chains) 318; — 
cylindrical 91. [Cf. Events.] 

Sevenrway lamps 24. 

Sex distribution in families 79, 81, 86, 
100, 125, 222. 

Sex-linked 96, 98, 102. 

Shewhart, W. A. 56. 


Shoe problems 46, 75. 

Shooting 80, 125. 

Shuffling 335; composite — 340. 

Smirnov, N. 147. 

Stable distribution: age, 276, 278; geno¬ 
type 94, 98, 102. 

Stakes, effect of changing 285. 

Standard deviation 178; (normal distr. 
133; successes 138). 

States in Markov chains 309; classifica¬ 
tion 320 (unessential — 344). 

Stationary distribution for aperiodic 
chains 328; periodic 331. [Cf. Stable 
distribution.] 

Steady state: age distr. 276, 278; birth 
and death process 373; genotype distr. 
95; power supply 384; servicing prob¬ 
lems 381, 382; waiting lines 379. 

Steinhaus, H. 6, 108. 

Stirling's formula 41, 134; alternative — 
50. 

Stochastic matrix 309; doubly — 327, 
336, 345; generalization 343, 359. 

Stochastic process 337, 363; stationary 
396. [Cf. Differential equations.] 

Stratified sampling 188. 

Stuart, E. E. 45, 336. 

Success 104; reduced number 138. [Cf. 
Runs.] 

Succession , law of 84. 

Summation formula 61. 

Survival, birth and death process 374, 
395; insect eggs, etc. 128; family 
names, genes 224. 

Systems of gambling 151, 282. 

Tauberian theorem 259, 272. 

Telephone statistics 122, 234, 365, 376; 
— trunking 143, 377; (holding times 
219). 

Tests , of grouping 56; sequential 127; 
statistical 45. 

Theta fundims 304. 

Thoday, J. M. 76, 120. 

Thorndike, E. 122. 

Ties: billiards 235; multiple coin games 
246, 261; dice 247. 

Time-homogeneous process 386. 

Tippett, L. H. C. 26. 

Traffic , incoming 376. 



INDEX 


419 


Traffic problem 339. 

Transient states 320; ergodic properties 
332; finite chains 35$. 

Transition probabilities 309; general 
process 337; higher — 317; stationary 
338; stochastic processes 385, 386. 

Trials , independent 89 (as Markov chain 
310); repeated 88, 91 (— and random 
variables 169; infinite sequences 149, 
282, 307). 

Truncation method 195, 200, 202, 210, 

212 . 

Trunking problems 143, 377, 396. 

Uhlenbeck, G. E. 293, 313. 

Uniform distribution 236. 

Union of events 13. 

Urn schemes as Markov chains 308; 
(Ehrenfest 312, 327, 345; Laplace 83, 
344; Polya 83, 101, 128). [Cf. Sam¬ 
pling.] 

Uspensky, J. V. 137, 140, 269. 


Variance 177 (from generating fct. 214); 

normal distr. 133. 

Veen, S. C. Van 235. 

Waiting lines 377, 378. 

Waiting times 218; (card drawing 35); 

exponential — 375. 

Wald, A. 127, 146, 197, 281, 300. 
Wang, Ming Chen 313. 

Welders problem 108, 384. 

Weldon's dice data 106. 

Whipple, F. J. W. 297, 300. 
Whitworth, W. A. 23. 

Wiener, N. 293. 

Wilks, S. S. 57. 

Wolfowitz, J. 146. 

Wrigiit, S. 315. 

X-rays , effect on cells 45, 76, 120, 222. 
Yosida, K. 346. 

Yule, G. U. 369; — process 368, 394. 





